This protocol illustrates how to explore, compare, and interpret human protein glycomes with online resources.
The Glyco@Expasy initiative was launched as a collection of interdependent databases and tools spanning several aspects of knowledge in glycobiology. In particular, it aims at highlighting interactions between glycoproteins (such as cell surface receptors) and carbohydrate-binding proteins mediated by glycans. Here, major resources of the collection are introduced through two illustrative examples centered on the N-glycome of the human Prostate Specific Antigen (PSA) and the O-glycome of human serum proteins. Through different database queries and with the help of visualization tools, this article shows how to explore and compare content in a continuum to gather and correlate otherwise scattered pieces of information. Collected data are destined to feed more elaborate scenarios of glycan function. Glycoinformatics introduced here is, therefore, proposed as a means to either strengthen, shape, or refute assumptions on the specificity of a protein glycome in a given context.
Glycans, proteins to which they are attached (glycoproteins) and proteins to which they bind (lectins or carbohydrate-binding proteins) are the main molecular actors at the cell surface1. Despite this central role in cell-cell communication, large-scale studies, including glycomics, glycoproteomics, or glycan-interactomics data are still scarce compared to their counterpart in genomics and proteomics.
Until recently, methods for characterizing the branching structures of complex carbohydrates while still being conjugated to the carrier protein had not been developed. The biosynthesis of glycoproteins is a non-template-driven process in which the monosaccharide donors, the accepting glycoprotein substrates, and the glycosyltransferases and glycosidases play an interactive role. The resulting glycoproteins can bear complex structures with multiple branching points where each monosaccharide component can be one of the several types present in nature1. The non-template-driven process imposes biochemical analysis as the only option for generating oligosaccharide structural data. The analytical process of glycan structures attached to a native protein is often challenging as it requires sensitive, quantitative, and robust technologies to determine monosaccharide composition, linkages, and branching sequences2.
In this context, mass spectrometry (MS) is the most widely used technique in glycomics and glycoproteomics experiments. As time goes, these are carried out in higher throughput settings and data is now accumulating in databases. Glycan structures in various formats3, populate GlyTouCan4, the universal glycan data repository where each structure is associated with a stable identifier irrespective of the level of precision with which the glycan is defined (e.g., possibly missing linkage type or ambiguous composition). Very similar structures are collected but their minor differences are clearly reported. Glycoproteins are described and curated in GlyConnect5 and GlyGen6, two databases cross-referencing each other. MS data supporting structural pieces of evidence are increasingly stored in GlycoPOST7. For a wider coverage of online resources, chapter 52 of the reference manual, Essentials of Glycobiology, is dedicated to glycoinformatics8. Interestingly, glycopeptide identification software has proliferated in recent years9,10 though not to the benefit of reproducibility. The latter concern prompted the leaders of the HUPO GlycoProteomics Initiative (HGI) to set a software challenge in 2019. The MS data obtained from processing complex mixtures of N- and O-glycosylated human serum proteins in CID, ETD, and EThcD fragmentation modes, were made available to competitors whether software users or developers. The full report on the results of this challenge11 is only outlined here. To begin with, a spread of identifications was observed. It was mainly interpreted as caused by the diversity of methods implemented in search engines, of their settings, and how outputs were filtered, and peptide “counted”. The experimental design may also have put some software and approaches at a (dis)advantage. Importantly, participants using the same software reported inconsistent results, thereby highlighting serious reproducibility issues. It was concluded by comparing different submissions that some software solutions perform better than others and some search strategies yield better results. This feedback is likely to guide the improvement of automated glycopeptide data analysis methods and will in turn, impact database content.
The expansion of glycoinformatics led to creating web portals that provide information and access to multiple similar or complementing resources. The most recent and up-to-date are described in a chapter of the Comprehensive Glycoscience book series12 and through cooperation, a solution to data sharing and information exchange is offered in an open access mode. One such portal was developed which was originally called Glycomics@ExPASy13 and renamed Glyco@Expasy, following the major overhaul of the Expasy platform14 that has hosted a large collection of tools and databases used across several -omics for decades, the most popular item being UniProt15-the universal protein knowledgebase. Glyco@Expasy offers a didactic discovery of the purpose and usage of databases and tools, based on a visual categorization and a display of their interdependencies. The following protocol illustrates procedures to explore glycomics and glycoproteomics data with a selection of resources from this portal that makes the connection between glycoproteomics and glycan-interactomics explicit via glycomics. As it is, glycomics experiments produce structures where monosaccharides are fully defined and linkages partially or fully determined, but their protein site attachment is poorly, if at all, characterized. In contrast, glycoproteomics experiments generate precise site attachment information but with a poor resolution of glycan structures, often limited to monosaccharide compositions. This information is pieced together in the GlyConnect database. Furthermore, search tools in GlyConnect can be used to detect potential glycan ligands which are described along with the proteins recognizing them in UniLectin16, linked to GlyConnect via glycans. The protocol presented here is divided into two sections to cover questions specific to N-linked and O-linked glycans and glycoproteins.
NOTE: A device with an Internet connection (larger screen preferred) and an up-to-date Web browser such as Chrome or Firefox is required. Using Safari or Edge may not be as reliable.
1. From a protein N-glycome in GlyConnect to a lectin of UniLectin
2. Exploring and comparing O-glycomes in GlyConnect
The first part of the protocol (section 1) showed how to investigate the specificity or the commonality of N-glycans attached on Asn-69 of the human Prostate Specific Antigen (PSA) using the GlyConnect platform. Tissue-dependent (urine and seminal fluid), as well as isoform-dependent (normal and high pI) variations in glycan expression, were emphasized using two visualizing tools (Figure 4 and Figure 5).
First, GlyConnect Octopus, which displays associations between entities stored in the database, provided the opportunity to explore contextual information via (1) selecting different entities to be shown in the Octopus and (2) clicking on links to examine related entries. The outcome was distinctive associations depending on the tissue.
Second, GlyConnect Compozitor, originally designed to define/refine a composition file for glycopeptide identification, was used to assess glycan expression in two known PSA isoforms (normal and high pI). The comparison of each isoform glycomes produced a well-connected graph singling out four nodes (compositions), two of which are characteristic of the high pI isoform. Even though the glycome overlap is significant, the glycan property bar chart showed a drop of sialylation from the common to the high pI isoform (Supplementary Figure 5).
Furthermore, the exploration of UniLectin3D singles out galectin-8 as a possible reader of the PSA glycome since the latter contains many structures with a NeuAc(a2-3)Gal(b1-4)GlcNAc terminal epitope. This provides a lead to follow and cannot be considered as final evidence. Nonetheless, PSA and galectins are known to play an essential role in prostate cancer29 and the specific role of Galectin-8 was recently highlighted30. The first part of the protocol correlates structural (glycoproteomics) and functional (binding) data to establish a likely scenario for protein-protein interactions mediated by glycans.
In the second part of the protocol (section 2), a high-quality set of O-glycan compositions associated with a particular tissue (human serum) was examined and compared to the GlyConnect database content, thereby offering the option of customizing a glycan composition file for the refined identification of glycopeptides (Figure 7 and Figure 8). It could rely on the minimal set of 20 compositions available from one dataset (HGI challenge results) or be enhanced with 23 to 26 items rationally collected in GlyConnect to strengthen the consistency of the set.
red | light orange | green | light blue | purple | pink | dark blue | brown | dark orange |
species | protein | tissue source | structure | composition | disease | reference | glycosite | peptide |
Table 1: Color scheme associated with each entity of the GlyConnect database and valid throughout.
Figure 1: Dependency wheel of Glyco@Expasy instantiated for GlyConnect. Please click here to view a larger version of this figure.
Figure 2: Suggested glycan structure for a selected glycan composition. Suggested glycan structure from a glycomics experiment for a glycan composition of a glycoproteomic experiment targeting the same glycoprotein, here human Prostate Specific Antigen (PSA), as proposed in the GlyConnect page for PSA (ID: 790). Please click here to view a larger version of this figure.
Figure 3: Side right menu of the GlyConnect page for PSA. Clickable cross-references to other major databases and display with LiteMol glycan plugin of existing 3D structure in the PDB. Please click here to view a larger version of this figure.
Figure 4: The output ofGlyConnect Octopus showing tissue-dependent associations between proteins and glycans. The query Hybrid AND Sialylated has returned all compositions matching these criteria and each composition links together the associated information about proteins and glycans as recorded in the database. Note that by default Species is set to Homo sapiens but this option is modifiable. Here, GlyConnect Octopus displays all human proteins (left nodes) carrying hybrid and sialylated glycan structures (right nodes) with the tissues in which they are expressed (center nodes). (A) The associations with urine are highlighted showing two proteins: choriogonadotropin (GLHA_HUMAN) and PSA common isoform (KLK3_HUMAN) connected to scattered (heterogeneous) glycan structures. (B) The associations with seminal fluid are highlighted showing two protein isoforms of PSA (KLK3_HUMAN) connected to grouped (similar) glycan structures. Please click here to view a larger version of this figure.
Figure 5: The output of GlyConnect Compozitor showing the superimposed N-glycomes of the two isoforms of PSA. Compositions in condensed notation label each node. The glycans associated with the common isoform are represented as blue nodes and those of the high pI isoform as red nodes. The overlap between glycomes is shown as magenta nodes. Numbers inside the nodes represent the number of glycan structures matching the labeled composition according to the content of the GlyConnect database regarding PSA. The Compozitor graph shown has been slightly modified from the raw output to disentangle the network which is generated by the D3.js library. This is easy to do as any node can be dragged in the browser window space wherever a user wishes, and the paths can thus be shortened or stretched. User can type a specific composition in the Zoom On field in the top-right corner to zoom in and center the graph on the corresponding node. Please click here to view a larger version of this figure.
Figure 6: Summary entry of the human galectin-8 with NeuAc(a2-3)Gal(b1-4)GlcNAc binding details. Clicking on the green View the 3D Structure and Information button (indicated with a red ellipse) opens a new page in which a close-up on residue interactions is displayed with the PLIP application (indicated by a red arrow). Please click here to view a larger version of this figure.
Figure 7: The output of GlyConnect Compozitor showing the O-glycome of the human serum high confidence dataset of the HGI challenge. Without virtual nodes (see text), the connectivity of that graph is low. Please click here to view a larger version of this figure.
Figure 8: The output of GlyConnect Compozitor showing the possibility of completion of the O-glycome of the human serum high confidence dataset of the HGI challenge, using the GlyConnect database content. Accessing the content of the entire GlyConnect database using the Compozitor's Custom tab reveals that compositions corresponding to the virtual nodes are mapped with existing defined structures as highlighted in the node labels. The node size represents the number of references stored in the database and reporting the corresponding composition. The numeric label of nodes denotes the number of corresponding structures stored in GlyConnect. Selected compositions appear to have zero to eighteen possible matches in the database. In fact, these nodes are only virtual as a reflection of the content of experimental datasets. It is recommended to refine the information in the graph to test the realism of these additional nodes. Please click here to view a larger version of this figure.
Supplementary Figure 1: Bubble chart of the Glyco@Expasy homepage. Zooming in the bubble chart of the Glyco@Expasy homepage to focus on the glycoprotein category. Software shown in green bubbles and databases in yellow bubbles. Clicking on any bubble summarizes the purpose of the resource. Please click here to download this File.
Supplementary Figure 2: Octopus-retrieved associations matching the query depending on composition. Default GlyConnect Octopus display of human proteins (left nodes) carrying hybrid and sialylated glycan structures (right nodes) with matching compositions (center nodes). Composition H6N4F12S1 appears unique to both PSA isoforms (KLK3_HUMAN). Clicking on the unique structure ID (10996) opens the corresponding page with details showing that the two isoforms are indeed the only proteins carrying this particular glycan. Please click here to download this File.
Supplementary Figure 3: Octopus-retrieved associations matching the query depending on the disease. GlyConnect Octopus display of all human proteins (left nodes) carrying hybrid and sialylated glycan structures (right nodes) with the diseases in which they are expressed (center nodes). The associations with prostate cancer are highlighted showing the common isoform of PSA (KLK3_HUMAN). Please click here to download this File.
Supplementary Figure 4: Octopus-retrieved associations matching the query depending on tissue information. GlyConnect Octopus display of all human proteins (left nodes) carrying bi-antennary glycan structures, including the NeuAc(a1-3)Gal(b1-4)GlcNAc motif (right nodes) with the tissues in which they are expressed (center nodes). The associations with seminal fluid are highlighted showing only the common isoform of PSA (KLK3_HUMAN) and seven structures. Please click here to download this File.
Supplementary Figure 5: The output of GlyConnect Compozitor showing the superimposed N-glycomes of the two isoforms of PSA. Compositions in condensed notation are labeling each node. The glycans associated with the common isoform are represented as blue nodes and those of the high pI isoform as red nodes. The overlap between glycomes is shown as magenta nodes. Numbers inside the nodes represent the number of glycan structures matching the labeled composition according to the content of the GlyConnect database regarding PSA. Mousing over the bar chart of glycan properties shows the correspondence between the frequency and the nodes as orange bubbles. Nearly all the PSA common isoform nodes are covered. This frequency drops in the high pI isoform. Please click here to download this File.
Supplementary Figure 6: Glycan search interface in UniLectin3D. Clicking on the sialic acid SNFG symbol (circled in red) launches the search for all ligands that contain NeuAc, stored in UniLectin3D. Please click here to download this File.
Supplementary Figure 7: Excerpt of the output of the search for all ligands that contain NeuAc. The NeuAc(a2-3)Gal(b1- 4)GlcNAc motif of interest is circled in red. Please click here to download this File.
Supplementary Figure 8: The output of GlyConnect Compozitor showing the O-glycome of the HGI dataset superimposed with the one in GlyConnect. The output of GlyConnect Compozitor showing the O-glycome of the human serum high confidence dataset of the HGI challenge in blue superimposed with the O-glycome of one O-glycosylated protein out of the 37 listed with the reference, i.e., inter-alpha-trypsin inhibitor heavy chain H4 with additional information contained in GlyConnect. This enhances the connectivity of the graph. Please click here to download this File.
GlyConnect Octopus as a tool for revealing unexpected correlations
GlyConnect Octopus was originally designed to query the database with a loose definition of glycans. Indeed, the literature often reports the main characteristics of glycans in a glycome such as being fucosylated or sialylated, being made of two or more antennae, etc. Furthermore, glycans whether N- or O-linked are classified in cores, as detailed in the reference manual Essentials of Glycobiology1, that are also often cited in published articles. Finally, glycan epitopes such as blood group antigens are yet another property sought in structures and potentially singled out for typing a glycan. In the end, it may be relevant to search for common or distinct characteristics of a glycome expressed in a specific tissue or selected species. In that sense, the collected information should be used as a source of new assumptions as opposed to unique facts.
GlyConnect Compozitor as a tool for shaping a glycan composition set
Browsing structural information as described in a protein page has limitations because lists tend to obscure the relationships between itemized structures as well as those between compositions. GlyConnect Octopus attends to the former and GlyConnect Compozitor to the latter. A careful look at structures listed in most GlyConnect entries reveals the existence of common substructures. Yet this information is not easy to grasp visually without the help of a dedicated viewer.
The content of the glycan composition file supporting the identification of the glycan moiety as a key parameter of glycopeptide identification software was established by analyzing the results of the HGI challenge. Most classical proteomics search engines accommodate the selection of glyco-based modifications from a collection that derives from data collected in databases/repositories or the literature. Other glycoproteomics dedicated tools use the knowledge of glycan biosynthesis. In this way, the composition file is theoretically defined as the result of expected enzymatic activity. In the end, there are as many composition files as there are search engines and the overlap between them is highly variable. Nonetheless, learning from past experience in proteomics, especially when posttranslational modifications are accounted for, reveals that the performance of search engines is correlated with limiting the search space31. Similar observations are made in glycoproteomics and GlyConnect Compozitor was designed to support educated composition data selection, the importance of which was previously discussed32.
The usage of this tool was incompletely illustrated in the protocol especially regarding the Advanced tab in which queries that directly launch programmatic access to GlyConnect via its API (Application Programming Interface) can be expressed. For example, typing taxonomy=homo sapiens&glycanType=N-linked&tissue=urine&disease=prostate cancer in the query window of the Advanced tab is equivalent to filling the corresponding fields in the Source tab (selecting Homo sapiens in Species, Urine in Tissue, and N-linked in Glycan Type) and the Disease tab (selecting Homo sapiens in Species, Prostate cancer in Disease and N-linked in Glycan Type). In other words, it provides in one step a result that would require several selections.
Finally, while the creation of virtual nodes is explained in the protocol, their potential redundancy needs an additional comment. Two concurrent options may be indistinguishable because the simulated action of enzymes in the graph does not account for the chronology of enzyme activities. That is why Compozitor suggests two paths through two virtual nodes to bridge two unconnected nodes corresponding to monosaccharide counts with up to two differences. The inclusion of new data often provides missing links. The user is always free to consider or dismiss virtual nodes, by (un)ticking the Include Virtual Nodes box.
Known databases and software limitations
Overall, as with any navigation on the Web, the protocols described above occasionally lead to a non-existent page, often due to an update of a site or a conflict of updates between two sites. In this case and, in fact, all cases where the navigation is not flowing, the easiest is to send a note to the Expasy helpdesk whose efficiency has significantly contributed to the portal's success in the past 28 years.
The content of GlyConnect is biased as a reflection of the current unbalances in the literature. The majority of publications report N-glycosylation in mammals and the database is richer in human N-glycoproteins. Nonetheless, we have been asked in the past to include lesser common datasets and to remain completely open to receiving advice and suggestions.
Besides, Compozitor is currently limited to the comparison of three composition datasets. A major revision of the Determinant subtab in the Octopus is planned. Resources of Glyco@Expasy need regular updates and some may not be carried out in due course; nonetheless, warnings and/or announcements are published when it so happens.
Partner portals, known as GlyGen (https://www.glygen.org) and GlyCosmos (https://www.glycosmos.org), provide different options and tools. Ultimately, browsing and searching information on either of the options entails a high level of subjectivity and largely depends on users’ habits and concerns. We can only hope that our solution suits a part of the community.
The input of glycoscience is growing in life science projects and studies establishing the role of glycans in health issues are continuously produced. The recent focus on Sars-Cov-2 revealed yet again the importance of glycosylated proteins, especially in structural approaches33. Glycoinformatics supports glycoscientists in daily tasks of data analysis and interpretation.
The authors have nothing to disclose.
The author warmly acknowledges past and present members of the Proteome Informatics Group involved in developing the resources used in this tutorial, specifically, Julien Mariethoz and Catherine Hayes for GlyConnect, François Bonnardel for UniLectin, Davide Alocci, and Frederic Nikitin for the Octopus, and Thibault Robin for Compozitor and final touch on Octopus.
The development of the glyco@Expasy project is supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) and is currently complemented by the Swiss National Science Foundation (SNSF: 31003A_179249). ExPASy is maintained by the Swiss Institute of Bioinformatics and hosted at the Vital-IT Competency Center. The author also acknowledges Anne Imberty for outstanding cooperation on the UniLectin platform jointly supported by ANR PIA Glyco@Alps (ANR-15-IDEX-02), Alliance Campus Rhodanien Co-funds (http://campusrhodanien.unige-cofunds.ch) Labex Arcane/CBH-EUR-GS (ANR-17-EURE-0003).
internet connection | user's choice | ||
recent version of web browser | user's choice |