$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
These representative results were obtained by following the procedure outlined in this protocol. A text mining association analysis was performed following the CaseOLAP LIFT protocol5 with default parameters, studying eight broad categories of cardiovascular diseases72 and their association with mitochondrial proteins (GO:0005739). In total, 635,696 reports through May 2024 were determined as relevant to these diseases; among them, 4,655 high-confidence protein-disease associations were identified to inform downstream analyses. A biomedical knowledge graph was constructed using the software code from Know2BIO using default settings in May 20249. The resulting knowledge graph consists of 219,450 nodes, 6,323,257 edges, as well as node features for 189,493 nodes with node descriptions, protein/gene sequences, chemical structure, etc. where available. An estimate of computational time for all steps in the protocol is presented in Table 1.
The RUGGED system was initialized by constructing the vector databases for both knowledge graph nodes and features as well as the CVD-relevant publications. All knowledge graph nodes, edges, and node features were processed with a chunk size of 20 tokens with the BART71 embedding model to prepare for RAG vector search. Similarly, original contributions and review articles were processed using a chunk size of 500 tokens and the BART embedding model to prepare for the RAG vector search. For literature retrieval, full-text publications greater than 500 tokens were hierarchically summarized based on the individual sections of a publication by the BART embedding model. The GPT-4o model was used for the remaining LLM agents in the system.
These representative results showcase an example use case to investigate potential drug therapeutics for Arrhythmogenic Cardiomyopathy (ACM) and Dilated Cardiomyopathy (DCM), identified as MeSH_Disease: D019571 and MeSH_Disease: D002311, respectively. A series of inquiries is outlined in Figure 3, with highlighted examples of model responses shown in Figure 4, and full response reported in Supplementary File 1, Section A. The direction of inquiry was adapted to the investigator-validated responses, crafting subsequent queries based on the results of the previous responses. The analysis revealed 11 drug candidates classified under beta blockers and antiarrhythmics. Novel avenues for therapeutic treatment were assessed using a Graph Convolutional Neural Network link prediction model on a subset of the complete knowledge graph, including nodes within 1-hop from study disease and drug nodes and their interconnections, with evaluation metrics reported in Table 4. The top 10 relevant edges for each prediction by the model were further examined by a graph explainability module, GNNExplainer44, to identify the top nodes and edges contributing to each prediction, respectively. The total cost of using commercial LLM for all steps of the RUGGED protocol for this use case is estimated at $1.50 at the time of writing.

Figure 1: Retrieval Under Graph-Guided Explainable disease Distinction (RUGGED) workflow. RUGGED consists of four primary components: (1) assembling and processing data from ethically sourced and professionally managed resources (e.g., PubMed and curated biomedical knowledge bases), (2) integrating peer-reviewed research findings into a unified knowledge graph, (3) structuring the text and graph data within database services, (4) modeling and predicting explainable relationships among biomedical entities within the knowledge graph, and (5) retrieving and synthesizing knowledge through a Retrieval Augmented Generation (RAG) workflow (Figure 2) to validate complex molecular relationships and explore AI-driven disease predictions. A human-in-the-loop review step can be conducted by the user to enhance the accuracy of the output. Please click here to view a larger version of this figure.

Figure 2: Retrieval architecture and bias mitigation workflow. The Retrieval Augmented Generation (RAG) framework employs multiple LLM agents, each executing specific tasks to support access to relevant information based on the user query. This system provides documented evidence for the user-facing GPT-based Reasoning Agent, facilitating user-agent interaction and synthesis of knowledge. (1) Biomedical Text Retrieval: Peer-reviewed original contributions and review articles are filtered based on their relevance to understanding disease associations. A vector database is constructed for author and editor-validated text evidence weighted based on the corresponding section of the publication, respectively: 70% Abstract, 10% Results, 10% Metadata, and 10% for all other subsections. A keyword search and similarity search against the text embedding of the user query together identify relevant documents. Summaries of each document are generated using a BERT-based summarizer, with the GPT-based Text Evaluator Agent refining the search to validate query-document relevance. (2) Knowledge Graph Retrieval: A BERT-based named entity recognition and GPT-based relation extraction module connects the user query to relevant entities in the knowledge graph. A similarity search in a vector database identifies pertinent nodes and edges. Data is retrieved from the Neo4j database via Cypher queries generated by the GPT-based Cypher Query Agent and refined by the Query Verification Agent. (3) The individual responses from the Biomedical Text Retrieval or Knowledge Graph Retrieval pipelines are presented to the Reasoning Agent, which synthesizes a concise response with minimal bias to the user's query. This system is guided to maintain accuracy and impartiality in presenting factual information. Please click here to view a larger version of this figure.

Figure 3: Use case on knowledge synthesis and hypothesis exploration via tiered query cascade. This figure showcases a highlighted use case focusing on a chain of related questions and concepts an investigator and/or healthcare professional might pose to the RUGGED system. Queries from the user are presented to the system in numerical order, with arrows representing inferred logical and domain-specific reasoning among each question. The system retrieves from the implicit and relevant information (source shown in blue), responding to the query. Examples of system responses are presented in Figure 4. Please click here to view a larger version of this figure.

Figure 4: Use case cardiovascular pathology: elucidating CVD pathogenesis. Query-response pairs between the user and the RUGGED system are shown. In the upper left panel, questions 1-6 retrieve information by extracting information from the knowledge graph database to formulate evidence-rooted responses. Question 7 employs an explainable graph link prediction to identify top-scoring therapeutics. The query prompts a prediction analysis, which is executed and processed automatically by the system, and key findings are succinctly summarized. Question 8 evaluates literature evidence from the defined text data corpus that is retrieved as relevant evidence to verify, validate, and corroborate the predicted finding. System responses have been reviewed by a human-in-the-loop inspection process and modified for readability and brevity. A full transcript of these findings is detailed in the Supplementary File 1. Please click here to view a larger version of this figure.
| Steps | Description | Time |
| Accessing Biomedical Knowledge | 30% total |
| Prepare biomedical literature corpus | Connect to PubMed and PubMed Central, download and parse publication data for downstream tasks. | 20% |
| Prepare knowledge base data | Connect to biomedical knowledge bases, download and parse necessary information for downstream tasks. | 5% |
| Information Extraction | 30% total |
| CaseOLAP LIFT Text Mining Analysis | Identify high level disease-protein relationships within the biomedical text corpus. | 25% |
| Knowledge Graph Construction | Connect and integrate disparate information from biomedical knowledge bases into a unified knowledge graph. | 5% |
| Prediction Analysis | 10% total |
| Train Graph Neural Network | Train the model on the biomedical knowledge graph data to learn hidden patterns within the graph. | 5% |
| Relevance Ranking Analysis | Apply explainability module to highlight the most pertinent nodes and edges relevant to study disease. | 2.5% |
| Link Prediction | Utilize explainability module to identify key nodes and edges contributing to new predicted edges. | 2.5% |
| Hypothesis Generation and/or Validation | 30% total |
| Database Setup for Retrieval Augmented Generation | Initialize the graph database for querying the knowledge graph and the vector database for text retrieval. | 25% |
| Hypothesis Exploration | Enable user interaction with RUGGED to access and scrutinize relevant information for hypothesis exploration. | 5% |
Table 1: Workflow and rate-limiting steps. This table provides rough estimates of the computational time required for each stage of the workflow. Rate-limiting steps include accessing, extracting, and indexing biomedical knowledge necessary for retrieval-augmented generation. Hypothesis exploration may be repeated continuously without the need to re-execute rate-limiting steps.
| Disease Category | MeSH Tree Numbers | # PMIDs | # Original Contributions | # Review Articles |
| Cardiomyopathies (CM) | C14.280.238 | 132,531 | 102,337 | 19,942 |
| C14.280.434 |
| Cardiac Arrhythmias (ARR) | C14.280.067 | 125,286 | 92,374 | 13,854 |
| C23.550.073 |
| Congenital Heart Defects (CHD) | C14.280.400 | 82,006 | 54,023 | 6,379 |
| Heart Valve Diseases (VD) | C14.280.484 | 72,016 | 50,119 | 5,743 |
| Myocardial Ischemia (IHD) | C14.280.647 | 256,986 | 210,042 | 30,223 |
| Cardiac Conduction System Disease (CCD) | C14.280.123 | 53,050 | 35,399 | 4,363 |
| Ventricular Outflow Obstruction (VOO) | C14.280.955 | 22,244 | 15,504 | 1,686 |
| Other Heart Diseases (OTH) | C14.280.195 C14.280.282 C14.280.383 C14.280.470 C14.280.945 C14.280.459 C14.280.720 | 114,085 | 77,302 | 11,799 |
| Total | 635,696 | 478,404 | 69,690 |
Table 2: Biomedical literature statistics. This table details the study disease categories with their corresponding MeSH tree numbers and the number of PubMed documents retrieved from through May 2024, used as the corpus for text mining. A subset of these publications, consisting of original contribution research articles and review articles, is indexed into a vector database for retrieval by RUGGED during hypothesis generation.
| Category | Number of Nodes | Number of Edges | Data Source(s) |
| Anatomy | 5,049 | 122,533 | Bgee, PubMed, MeSH, Uberon, |
| Biological Process | 27,047 | 108,106 | Gene Ontology |
| Cellular Component | 4,057 | 52,238 | Gene Ontology |
| Compound | 27,278 | 3,292,028 | DrugBank, MeSH, CTD, UMLS, KEGG, TTD, SIDER, Inxight Drugs, Hetionet, PathFX, MyChem.info |
| Disease | 21,938 | 311,773 | PubMed, MeSH, DisGeNET, SIDER, ClinVar, ClinGen, PharmGKB, MyDisease.info, PathFX, UMLS, OMIM, Mondo, DOID, KEGG |
| Drug Class | 5,721 | 8,283 | ATC |
| Gene | 29,810 | 943,419 | HGNC, GRNdb, KEGG, ClinVar, ClinGen, |
| Molecular Function | 11,151 | 47,086 | SMPDB, DisGENET, PharmGKB, MyGene.info |
| Pathway | 52,012 | 234,944 | Gene Ontology |
| Protein | 20,740 | 1,074,809 | Reactome, KEGG, SMPDB |
| Reaction | 14,647 | 128,038 | UniProt, Reactome, TTD, SMPDB, STRING, HGNC |
| Subtotal | 219,450 | 6,323,257 | Reactome |
| Text-mining Associations | 8 | 4,670 | |
| Total | 219,458 | 6,327,927 | |
Table 3: Knowledge graph statistics. This table details 11 broad biomedical categories comprising the constructed Know2BIO knowledge graph, enriched with additional edges derived from text mining analysis and predictive analysis. The resulting knowledge graph and predictions are managed by the Neo4j graph database for retrieval by RUGGED during hypothesis generation.
| Accuracy | Precision | Recall | F1-score | AUROC | AUPRC |
| Validation | 0.7158 | 0.6639 | 0.8743 | 0.7547 | 0.8437 | 0.8637 |
| Test | 0.703 | 0.6367 | 0.9455 | 0.761 | 0.8961 | 0.9094 |
Table 4: Explainable AI model evaluation. This table reports the evaluation metrics for the knowledge graph link prediction using a two-layer graph convolutional neural network. Metrics were assessed by partitioning graph edges into 85% training, 5% validation, and 10% test datasets. Accuracy indicates the proportion of correctly classified predictions. Precision reports the proportion of correct positive predictions among all positive predictions. Recall measures the proportion of correct positive predictions among actual positive edges. The F1-score is the harmonic mean of precision and recall, balancing the two metrics. AUROC evaluates the model's ability to differentiate between positive and negative predictions. AUPRC quantifies the trade-off between precision and recall across different thresholds. With all metrics, higher values indicate better model performance.
Supplementary File 1: This file details the full model response from RUGGED and a comparison against GPT-4o. Section A presents the complete human-computer interaction with RUGGED, expanding on the chain-of-query approach outlined in Figure 3 and providing the full response beyond the summary highlighted in Figure 4. Section B evaluates GPT-4o's responses without retrieval against RUGGED's, assessing attributes such as precision, depth, confidence scoring, evidence reliability, and cost. Please click here to download this file.