A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes

Cross-talk between genes, transcripts, and proteins is the key to cellular responses; hence, analysis of molecular levels as distinct entities is slowly being extended to integrative studies to enhance the understanding of molecular dynamics within cells. Current tools for the visualization and integration of proteomics with other omics datasets are inadequate for large-scale studies. Furthermore, they only capture basic sequence identify, discarding post-translational modifications and quantitation. To address these issues, we developed PoGo to map peptides with associated post-translational modifications and quantification to reference genome annotation. In addition, the tool was developed to enable the mapping of peptides identified from customized sequence databases incorporating single amino acid variants. While PoGo is a command line tool, the graphical interface PoGoGUI enables non-bioinformatics researchers to easily map peptides to 25 species supported by Ensembl genome annotation. The generated output borrows file formats from the genomics field and, therefore, visualization is supported in most genome browsers. For large-scale studies, PoGo is supported by TrackHubGenerator to create web-accessible repositories of data mapped to genomes that also enable an easy sharing of proteogenomics data. With little effort, this tool can map millions of peptides to reference genomes within only a few minutes, outperforming other available sequence-identity based tools. This protocol demonstrates the best approaches for proteogenomics mapping through PoGo with publicly available datasets of quantitative and phosphoproteomics, as well as large-scale studies.


Introduction
In cells, genome, transcriptome, and proteome affect each other to modulate a response to internal and external stimuli and interact with each other to carry out specific functions leading to health and disease. Therefore, characterizing and quantifying genes, transcripts, and proteins is crucial for fully understanding cellular processes. Next-generation sequencing (NGS) is one of the most commonly applied strategies to identify and quantify gene and transcript expression. However, protein expression is commonly assessed by mass spectrometry (MS). Significant advancements in MS technology over the last decade has enabled more a complete identification and quantification of proteomes, making the data comparable with transcriptomics 1 . Proteogenomics and multi-omics as ways to integrate NGS and MS data have become powerful approaches to assess cellular processes across multiple molecular levels, identifying subtypes of cancer and leading to novel potential drug targets in cancer 2,3 . It is important to note that proteogenomics was initially used to provide proteomic evidence for gene and transcript annotations 4 . Several genes previously thought to be non-coding have recently undergone reevaluation considering large-scale human tissue datasets 5,6,7 . In addition, proteomic data are successfully used to support annotation efforts in non-model organisms 8,9 . However, proteogenomic data integration can be exploited further to highlight protein expression in relation to genomic features and elucidate cross-talk between transcripts and proteins by providing a combined reference system and methods for co-visualization.
In order to provide a common reference for proteomics, transcriptomics, and genomics data, numerous tools have been implemented for mapping peptides identified through MS onto genome coordinates 10,11,12,13,14,15,16,17 . Approaches differ in aspects such as mapping reference, support of genome browsers, and degree of integration with other proteomics tools as shown in Figure 1. While some tools map reverse translated peptides onto a genome 16 , others use a search engine annotated position within a protein and gene annotation to reconstruct the nucleotide sequence of the peptide 15 . Still others use a 3-or 6-frame translation of the genome to map peptides against 11,13 . Lastly, several tools skip the nucleotide sequences and use amino acid sequence translations from RNA-sequencing mapped transcripts as an intermediate to map peptides to the associated genome coordinates

Mapping Peptides with Annotated Post-translational Modifications and Visualization Including Quantitation
NOTE: The resulting output file can be loaded in any genome browser supporting Browser Extensible Data (BED) format. A selection of browsers is the Integrative Genome Browser (IGV) 24 (which is used in the following), the UCSC Genome Browser 25 , and the Ensembl Genome Browser 20 . It is important to note that the annotation GTF and protein FASTA versions used for PoGo mapping match the version of the genome in the genome browser. For the human Ensembl releases 57-75 and GENCODE versions 3d-19, use GRCh37/hg19; for the Ensembl versions 76 or higher and GENCODE 20 or higher, use GRCh38/hg38. For the mouse Ensembl versions 74 or higher and GENCODE M2 or higher, use GRCm38.
1. Navigate to the executables folder. Start the program by double-clicking the icon PoGoGUI-vX.X.X.jar. NOTE: The graphical user interface will start up and allow easy and visual selection of options. 2. Use the Select button next to the "PoGo Executable". Then, navigate in the executables folder to the relevant operating systems subfolder (e.g., C:\PoGo\Executables\Windows\). Select the executable of PoGo (e.g., PoGo.exe) and confirm its selection by clicking the Open button. 3. Select the reference input file for protein sequences by clicking Select. Navigate to the data folder and select the translation FASTA file. Confirm its selection by clicking the Open button. 4. Select the transcript annotation file using the Select button. Navigate to the data folder and select the annotation GTF file. Confirm the selection by clicking the Open button. 5. Add the peptide identification file-multiple file selection is enabled-by using the Add button next to "Peptide Files". Select a file in the supported format mzTab, mzIdentML, or mzid, or in the tab-separated 4-column format downloaded and prepared in step 1.3. 6. Untick the checkboxes next to BED and GTF in the output formats selection. Only leave PTM BED and GCT checked. 7. Select the appropriate species for the data from the drop-down selection. It is essential that the FASTA file, the GTF file, and the dropdown selection are for the same species. 8. Start mapping by clicking the START button.
NOTE: If necessary, PoGoGUI will convert the input file into pogo format, provide the pogo files in the same folder for future convenience, and start the mapping process. The conversion of a single mzTab file downloaded in step 1.3.1 will last between 10 -20 min before mapping commences. NOTE: Due to size, some files may require the generation of an index to allow a quick reloading of the genomic regions. The IGV will prompt the user automatically to the generation. Follow the instructions indicated. 2. Repeat the loading step for the file ending in "_noptm.bed". This file contains all peptides found without any modification. 3. Note that each loaded file will be shown as separate tracks with the file name identifying the track. Reorder tracks by dragging and dropping them to the desired position in the list. 4. Note that each track is initially shown in a collapsed manner. To expand them, right click on the track name and select either expanded for a full view of the peptides including the sequences or squished for a stacked view. 5. Repeat the loading step for the file ending in ".gct". This file contains the peptide quantitation per annotated sample. 6. Unlike for the files loaded above, each annotated sample will be loaded as a separate track. Reorganize the samples through drag and drop operations. 7. Navigate within the genome by selecting a chromosome in the drop-down menu, type in genomic coordinates, search a gene symbol, or click and hold to select a section of a chromosome to zoom in.

Mapping Peptides Identified Through a Custom Variant Database to a Reference Genome
NOTE: PoGo mapping can be carried out using the graphical user interface (GUI) or through the command line interface. They are interchangeable. In this part of the protocol, the command line interface is used to highlight interchangeability. The second part of this protocol section requires the software tool R 26 . Please ensure that the package is installed.
1. Map the reference peptides to the reference genome. 1. Open a command prompt (cmd) and navigate to the executables folder of PoGo (e.g., C:\PoGo\Executables\).

Type the command below:
PoGo.exe -gtf \PATH\TO\GTF -fasta \PATH\TO\FASTA -in \PATH\TO\IN -format BED -species MYSPECIES 1. Substitute the \PATH\TO\GTF, \PATH\TO\FASTA, and \PATH\TO\IN with paths to the annotation GTF, protein sequence FASTA, and peptide identification file (in the 4-column format with file ending ".tsv" or ".pogo") respectively. Also substitute MYSPECIES with the species consistent with the data (e.g., Human).
3. Confirm the execution by pressing the "Enter" key. Wait till the execution is finished before progressing any further.
visualization, is shown in Figure 5. Shotgun proteomics (i.e., the proteolytic digestion of proteins followed by liquid chromatography coupled with tandem mass spectrometry) is one precursory step of proteogenomic mapping. The resulting tandem mass spectra are commonly compared to theoretical spectra derived from protein sequence databases. Proteogenomics studies introduce translation sequences of novel transcripts with coding potential and non-synonymous single nucleotide variants (SNVs) into the database, making it hard to easily relate these back to the reference genome 8 . The graphical user interface of PoGo (PoGoGUI) supports file formats for the standardized reporting of peptide identifications from mass spectrometry experiments and converts them into the simplified 4-column pogo format. PoGoGUI wraps the command line tool PoGo and thus enables the mapping of peptides onto genome coordinates utilizing the reference annotation of protein-coding genes commonly provided in the GTF and the translated transcript sequences in FASTA format. Different output formats are generated by PoGo to enable the visualization of different aspects of the peptides identified through mass spectrometry, including post-translational modifications and peptide level quantification. Output files in the BED can further be converted and combined into online accessible directories called track hubs. Single output files, as well as track hubs, then can be visualized in browsers such as the UCSC Genome Browser 25 , Ensembl Genome Browser 20 , IGV 24 , and Biodalliance 28 (see Figure 5 bottom).
We applied PoGo to the reanalysis of the draft human proteome maps filtered at high significance as described in Wright et al. 7 and compared it to two other tools for proteogenomic mapping, namely iPiG 14 and PGx 10 . The dataset comprised 233,055 unique peptides across 59 adult and fetal tissues resulting in a total of over 3 million sequences. PoGo outperformed these tools both in runtime (6.9x and 96.4x faster, respectively) and memory usage (20% and 60% less memory, respectively) as shown in Figure 6 18 . An example of a successfully mapped peptide is shown in Figure 7.
While PoGo significantly outperformed the other tools in speed and memory, it also is capable of mapping post-translational modifications and quantitative information associated with peptides onto the genome. Figure 8A schematically depicts the visualization of the BED format in a genome browser for peptides mapping to one exon and across splice junctions. PoGo utilizes the coloring option to provide easy visual aid with respect to the uniqueness of the peptide mapping within the genome. Mappings in red indicate uniqueness to a single transcript, while black highlights mapping to a single gene. However, the peptide is shared between different transcripts. Grey mappings show a peptide shared between multiple genes. These are, for example, less reliable for the quantification of a gene or untrustworthy to call the expression of a gene. The PTM BED option of PoGo redefines the color code to accommodate different types of post-translational modifications as shown in Figure  8B. Additionally, PTMs are indicated by thick blocks (see Figure 8B). A single PTM of a type is highlighted by a thick block at the position of the modified amino acid residue, while multiple PTMs of the same type are spanned by a thick block from the first modified amino acid to the last.
We applied PoGo and subsequently TrackHubGenerator to a dataset of 50 colorectal cancer cell lines including whole proteome and phosphoproteome 29 . While the track hub loaded in the UCSC Genome Browser shows the peptides mapped to the genome and highlights the uniqueness of the mappings and the phosphorylation sites (see Figure 9), additional data are provided in the supplemental folder. The GCT files then enable the visualization of the peptide and phosphopeptide quantitation in a genomic context. However, GCT files do not provide an easy visualization of peptides spanning across splice junctions (see Figure 10 top). The peptides across splice junctions are split into their respective parts mapping to the exons. While it is possible to identify splice peptides through the same quantitative values of exon mappings, loading sequence-based mapping files such as BED or GTF that connect the exons by a thin intron spanning line support the interpretation (see Figure 10 bottom).
To highlight the utility of variant enabled mapping, we applied PoGo in two configurations to a dataset of human testis proteome searched against neXtProt to hunt for missing proteins using a multi-enzyme strategy 22 . The neXtProt comprises besides reference protein sequences over 5 million single amino acid variants 30 . Mapping peptides identified with a single amino acid variant is not supported by other mapping tools. A total of 177,012 unique peptides were identified. Of these, 99.8% (176,694) peptides were first successfully mapped without allowing mismatches. Removing those from the identified peptide list resulted in 0.2% (318) peptides that subsequently were mapped allowing one amino acid substitution. This resulted in 3,446 mappings of 162 peptides that would not have been mapped to the reference genome with any other available tool. While the average number of mappings including a mismatch is high, 62 peptides were mapped to only a single locus, indicating true variant sequences. An example of a peptide mapped with a single amino acid substitution is highlighted with its sequence and the translated genomic sequence in Figure 11.

Discussion
This protocol describes how the software tool PoGo and its graphical user interface PoGoGUI enable a fast mapping of peptides onto genome coordinates. The tool offers unique features such as quantitative, post-translational modification and variant-enabled mapping to genomes using reference annotation. This article demonstrates the method on a large-scale proteogenomic study and highlights its speed and memory efficiency compared to other available tools 18 . In combination with the tool TrackHubGenerator, which creates online accessible hubs of genomic and genome linked data, PoGo, with its graphical user interface, enables large-scale proteogenomics studies to quickly visualize their data in genomic context. Furthermore, we demonstrate the unique features of PoGo with datasets searched against variant databases and quantitative phosphoproteomics 22,29 .
Single files, such as the GCT file, provide valuable visualization and links between peptide features and genomic loci. However, it is important to note that an interpretation based on these alone may be difficult or misleading due to their limitation to single aspects of proteogenomics such as uniqueness, post-translational modifications, and quantitative values. Therefore, it is important to carefully choose which output files, options, and combinations are appropriate for the proteogenomic question at hand and modify the combinations. For example, information about the . The Output should be generated by PoGo for each setting. In case no output is generated, or empty files are shown in the output folder, it is recommended to check the input files for the desired content and the required file format. In cases where the file format or content does not follow the expectations of PoGo (e.g., the FASTA file supposedly containing the transcript translation sequences contains the nucleotide sequences of the transcripts), error messages will ask the user to check the input files.
Restrictions of the protocol and the tool are mostly based on the reuse of file formats commonly used in genomics. Repurposing file formats used in genomics for proteogenomic applications is accompanied by specific limitations. These are due to the differing sets of requirements for genome centered visualization of genomic and proteogenomic data, such as the need to visualize post-translational modifications from proteomics data. This is restricted in the genomics file formats by single feature usage. Many approaches and tools have been developed for proteomics to confidently localize post-translational modifications within peptide sequences 31,32,33,34 . However, the visualization of multiple modifications in a unique and discernable manner on the genome is hindered by the structure of genomic file formats. Therefore, the single block visualization of multiple PTMs of the same type does not constitute any ambiguity of the modification sites but is the consequence of the differing requirement from the genomics community to only visualize single features at a time. Nonetheless, PoGo has the advantage of mapping post-translational modifications onto genomic coordinates to enable studies focused on the effect of genomic features such as single nucleotide variants on post-translational modifications. Using PoGo, variant mapping increases the number of total mappings. However, the unique color coding of mapped peptides highlights reliable mappings from unreliable ones. The mapping of variant peptides identified from known single nucleotide variants can be accompanied by visualizing the mapped peptides alongside the variants in VCF format. This way the color code indicating an unreliable mapping of a variant peptide is overruled by the presence of the known nucleotide variant.
A critical step for using PoGo is the use of the correct files and formats. The use of translated transcript sequences as protein sequences to accompany the annotation in GTF format is the main criteria. Another critical element when considering using PoGo to map peptides with amino acid mismatches is memory. While highly memory-efficient for a standard application, the significantly and exponentially increasing number of possible mappings with one or two mismatches leads to a similarly exponential increase in memory usage 18 . We propose a staged mapping as described in this protocol to first map the peptides without mismatches and remove them from the set. The subsequent previously unmapped peptides then can be mapped using one mismatch and the procedure can be repeated with two mismatches for the peptides remaining unmapped.
Since the throughput of mass spectrometry has significantly increased and studies interfacing genomic and proteomic data are becoming more frequent in recent years, tools to readily enable interfacing these types of data in the same coordinate system are increasingly indispensable. The tool presented here will aid the need to combine genomic and proteomic data to enhance a better understanding of integrative studies across small and large datasets by mapping peptides onto a reference annotation. Encouragingly, PoGo has been applied to map peptides to gene candidates provided in the same format as the reference annotation to support annotation efforts of novel genes expressed in human testis 35 . The approach presented here is independent of databases used for peptide identification. The protocol might aid in the identification and visualization of novel translation products by using adapted input files from translation sequences and associated GTF files from RNA-seq experiments.
Several approaches and tools with a wide range of special application scenarios to map peptides to genomic coordinates, ranging from mapping peptides directly to the genome sequence to RNA-sequencing guided mapping, have been introduced 10,11,12,13,14,15,16,17 . However, these can result in a failure to correctly map peptides when post-translational modifications are present and errors in the underlying mapping of RNAsequencing reads may be propagated to the peptide level. PoGo has been developed to specifically overcome those obstacles and to cope with the rapid increase of quantitative high-resolution proteomic datasets to integrate with orthogonal genomics platforms. The tool described here can be integrated into high-throughput workflows. Through the graphical interface PoGoGUI, the tool is simple to use and requires no specialist bioinformatics training.

Disclosures
The authors have nothing to disclose.