Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

Alexander  R. Pelletier; Joseph Ramirez; Baradwaj Simha Sankar; Irsyad Adam; Yu Yan; Dylan Steinecke; Wei Wang; Karol E. Watson; Peipei Ping

doi:10.3791/67525

Method Article

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

DOI:

10.3791/67525

⸱

June 13th, 2025

Alexander R. Pelletier¹^,² , Joseph Ramirez¹ , Baradwaj Simha Sankar¹ , Irsyad Adam¹^,⁴ , Yu Yan¹^,³ , Dylan Steinecke¹^,³ , Wei Wang¹^,² , Karol E. Watson¹^,³^,⁴ , Peipei Ping¹^,²^,³^,⁴

¹Department of Physiology, UCLA School of Medicine, ²Scalable Analytics Institute (ScAi) at Department of Computer Science, UCLA School of Engineering, ³Medical Informatics, University of California at Los Angeles (UCLA), ⁴Department of Medicine (Cardiology), UCLA School of Medicine

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This article describes RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction), which integrates Large Language Model (LLM) inference with Retrieval-Augmented Generation (RAG). It draws evidence from expert-curated biomedical knowledge bases and peer-reviewed biomedical publications to synthesize new knowledge from up-to-date information, identify explainable and actionable predictions, and pinpoint promising directions for hypothesis-driven investigations.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The scale of biomedical knowledge, spanning scientific literature and curated knowledge bases, poses a significant challenge for investigators in processing, evaluating, and interpreting findings effectively. Large Language Models (LLMs) have emerged as powerful tools for navigating this complex knowledge landscape but may produce hallucinatory responses. Retrieval-Augmented Generation (RAG) is essential for identifying relevant information to enhance accuracy and reliability. This protocol introduces RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction), a comprehensive workflow designed to support knowledge integration, to mitigate bias, and to explore and validate new research directions. Biomedical information from publications and knowledge bases are synthesized and analyzed through text-mining association analysis and explainable graph prediction models to uncover potential drug-disease relationships. These findings, along with the source text corpus and knowledge bases, are incorporated into a framework that employs RAG-enhanced LLMs to enables users to explore hypotheses and investigate underlying mechanisms. A clinical use case demonstrates RUGGED's capability in evaluating and recommending therapeutics for Arrhythmogenic Cardiomyopathy (ACM) and Dilated Cardiomyopathy (DCM), analyzing prescribed drugs for molecular interactions and potential new applications. The platform reduces LLM hallucinations, highlights actionable insights, and streamlines the investigation of novel therapeutics.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The hypothesis exploration process in the biomedical enterprise is essential to uncover novel molecule-drug-disease interdependencies underlying pathogenesis and to unlock therapeutic potential¹^,². This process draws evidence from existing biomedical knowledge, synthesizing new findings based on logical leads embedded within peer-reviewed literature (e.g., >36 reports from PubMed), and integrating high-confidence curated evidence rooted among biomedical knowledge bases. Recent advancements reduce laborious manual effort by applying text mining on literature corpra³^,⁴^,⁵ as well as employing graph-based analyses⁶^,⁷^,⁸^,⁹ to synthesize relevant information and uncover new avenues for investigation. Despite these efforts, current approaches often do not support deep contextual understanding due to fragmented data. Furthermore, they lack the ability to draw evidence-based inferences and to interactively explore new hypotheses.

Recent developments in Large Language Models (LLMs) shed new light on these challenges, demonstrating high-level contextual understanding by training on vast amounts of information across multiple disciplines¹⁰^,¹¹^,¹². In the biomedical domain, LLMs have shown a promising role in extracting patient information¹³ and general clinical question answering¹⁴^,¹⁵, whereas applications in domain-specific question answering¹⁶ and utilities in primary clinical care¹⁷ remain to be explored. These models exhibit the ability to reason and draw inferences from complex datasets, rendering them potentially suited for conducting hypotheses exploration and knowledge synthesis. Furthermore, some models feature chat-like interaction to engage users and enable dynamic exploration of topics, surpassing the conventional boundaries of query-based search engines and knowledge bases¹⁸^,¹⁹.

In addition to these potentials, LLMs face significant challenges, such as possible hallucination of information, displaying unwarranted confidence in potentially inaccurate explanations, lacking interpretability, and being susceptible to biased or inappropriate content²⁰^,²¹^,²²^,²³^,²⁴. Applied directly to guiding clinical decision-making, the LLM-derived responses and predictions have high stakes; any errors may potentially result in costly laboratory experiments or negatively affect patient health trajectories²⁵^,²⁶. Thus, reliable and trustworthy LLM responses are paramount, as their advice must be firmly rooted in evidence. In these scenarios, interpretability is not a luxury but a necessity for understanding why these models make the predictions they do.

To this end, Retrieval-Augmented Generation (RAG) is a system designed to minimize LLM hallucinations, grounding LLM responses in evidence to enhance their accuracy and reliability²⁷^,²⁸. This approach typically involves retrieval of relevant text passages, such as integrating an LLM (e.g., ChatGPT) with PubMed, allowing for the identification of relevant citations to user queries²⁹^,³⁰. Not limited to text, retrieval on Knowledge Graphs (KGs) show promise in application to LLMs for tasks such as fact-checking³¹^,³²^,³³, transparent reasoning³⁴^,³⁵^,³⁶, encoding knowledge³⁷, improving question answering³⁸, and completing knowledge graphs³⁹. By encoding factual information from verified sources, KGs enhance the accuracy, transparency, and reliability of LLM responses. Link prediction techniques within these graphs leverage deep learning to identify previously hidden relationships among molecules, drugs, and diseases⁵^,⁴⁰^,⁴¹. Recent advancements in explainable AI predictions further enhance the transparency and interpretability of these link prediction tasks, lending potential support to interpret biomedical hypotheses as a viable avenue for investigation⁴²^,⁴³^,⁴⁴. These advancements ensure that LLM-generated responses are balanced and drawn from the evidence, significantly boosting their applicability in biomedical enterprise.

This protocol presents RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction) as an accessible and efficient workflow for the exploration and validation of clinical therapeutic insights (Figure 1). This workflow protocol leverages the vast resources of biomedical literature and knowledge bases for the extraction and validation of relevant information, enabling query-tailored retrieval processes (Figure 2). An explainable artificial intelligence prediction model is employed to uncover interpretable and actionable insights from the existing biomedical knowledge, thereby enhancing the transparency and utility of predictive models. The completed workflow streamlines the exploration of knowledge graphs and model predictions via RAG-enabled LLMs, facilitating intuitive and informed interactions for investigators, clinicians, and clinical professionals.

This section lays the groundwork for the protocol, with steps to implement this approach described in the following section. Next, a translational clinical use case is showcased to demonstrate this approach, applied to the evaluation of drugs for molecular interactions as well as therapeutic strategies for cardiovascular medicine. Finally, the implications and discussion of this protocol are discussed.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol has been developed in Python 3.10 and implemented as a Docker container in Windows. The commands provided are based on the Unix environment within the Docker container. The software is available at https://github.com/pinglab-utils/RUGGED. Table 1 presents an estimate of computational time for all steps in the protocol.

1. Installing the software

Install the prerequisite software following the instructions in the Table of Materials.
NOTE: This protocol requires version control, containerization, a graph database, and large language model (LLM) service(s). Version control and containerization are optional but can simplify the setup process; graph database and LLM services may be substituted with similar tools if the user is technically proficient.
1. Configure Inter-container Networking. Configure Docker containers to be connected to other services on the device (e.g., other Docker containers). Type the following command into the terminal: docker network create rugged_network
Set up Large Language Models (LLMs) services. Choose the appropriate LLM service for the use case, among commercial LLM services or services from a local model running on the user's device. Ensure a minimum of one LLM service is specified, though agents can be mixed and matched to leverage different models.
1. Start Local LLM service. If using Ollama using a Graphical User Interface (GUI), run the GUI executable (e.g., ollama.exe). If using Docker, run: `docker run -name ollama --net rugged_network d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama`. If using Docker with GPU acceleration, ensure the GPU driver is installed and run: `docker run -name ollama --net rugged_network -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama`.
2. Initialize local LLM model. Determine which model to use among the supported models (e.g., Recommended: llama3, mistral, mixtral. If using Docker, type 'docker exec run ollama run <MODEL_NAME>' in the command line; if using Ollama GUI, type `ollama run <MODEL_NAME>`, replacing <MODEL_NAME> with the model name for each.
Start Graph Database service. Select a graph database service among Docker container, desktop application, or online web service. Follow the installation instructions in the Supplementary Materials to complete the setup.
Set up the RUGGED environment. Verify the downloaded Docker images by typing docker images. Ensure all Docker images from the previous step are listed. Run these commands in the terminal to download the RUGGED Docker image and code:
docker pull pinglabutils/rugged:latest
NOTE: git clone https://github.com/pinglab-utils/RUGGED
1. Configure Commercial LLM service. If using commercial LLM services, ensure the account and associated API key have sufficient funds. Modify RUGGED configuration files by editing the configuration file at `RUGGED/config/openai_key.txt` and adding the API key to the file.
2. Configure commercial agents. Determine which LLM agents within RUGGED's system will use this service. Modify the configuration file at `RUGGED/config/llm_agents.json` and update the agent fields to specify the model version. Recommended models: gpt-3.5-turbo, gpt-4o.
3. Configure Local LLM service. If using a different service endpoint than the default endpoint for Ollama at `http://localhost:11434`, modify and update the `OLLAMA_URI` field within the configuration files at `RUGGED/config/ollama_config.json`.
4. Configure Local LLM agents. Determine which LLM agents within RUGGED's system will use this service. Modify the configuration file at `RUGGED/config/llm_agents.json` and update the agent fields to specify `ollama` as the selected model.
5. Configure the graph database endpoint. If modified from the default password and username for Neo4j, edit the `RUGGED/config/neo4j_config.json` configuration file to update the `uri`, `username`, and `password` fields.
Start the RUGGED service by running the command:
docker run --name rugged -it --net rugged_network --gpus=all -v <PATH_TO_FOLDER>\RUGGED\:/data ping-lab-
utils:RUGGED /bin/bash
NOTE: To verify the services are working as expected, navigate to the RUGGED directory and execute steps 1.4.1. through step 1.4.4. in this terminal window.
1. Verify LLM service functionality. Navigate to the test folder in the RUGGED directory and execute the following commands to verify that OpenAI and/or Ollama services are functioning:
  python test_openai.py
  python test_ollama.py
2. Verify named entity recognition service functionality. Execute `test_ner.py` to verify the code for Named Entity Recognition of user queries is functioning properly.
3. Verify Neo4j service functionality. Execute test scripts to verify the Neo4j service is functioning as expected by typing `python test_neo4j.py`
4. (Optional) Verify HTTP access to graph database. Open a web browser and visit the Neo4j user interface.
  NOTE: For Neo4j in Docker or Desktop, the default URL is `http://localhost:7474`. For Neo4j AuraDB, use the link provided during setup.
(Optional) Troubleshoot issues. Ensure the services supporting RUGGED are verified during software setup to anticipate issues. Troubleshoot any unsuccessful tests from step 1.4. If they exist, follow the error messages reported by the test scripts describing the issues.
1. Verify Docker Containers. Confirm all Docker containers are running by using `docker ps` in the terminal, including the RUGGED docker container, Neo4j docker container (optional), and Ollama docker container (optional).
2. Verify Networking ports. For Docker services, ensure the correct ports are open and check logs with `docker logs neo4j` or `docker logs ollama`.
  NOTE: By default, Neo4j uses ports 7474 for http and 7687 for its bolt interface; Ollama uses port 11434.
3. Verify service applications. For applications installed directly on the device (e.g., Ollama and Neo4j Desktop), open the applications to confirm they are running.
4. Verify web services. For Neo4j AuraDB, log into the website and verify the service is running.
5. Verify firewall rules. Modify device firewall rules to ensure the firewall is not blocking any external services.
6. Restart device. If issues are not resolved, restart the device and retry from step 1.5.1.
7. Open an issue. If problems persist, please open an issue on the RUGGED GitHub (https://github.com/pinglab-utils/RUGGED).

2. Accessing biomedical knowledge and extraction information

NOTE: These steps outline two knowledge extraction pipelines as the underlying information constituting the Retrieval Augmented Generation (RAG) system of RUGGED: (1) the CaseOLAP LIFT biomedical text mining pipeline⁵ and (2) the Know2BIO knowledge graph construction workflow⁹. To use RUGGED with custom data, proceed to step 4.

Extract Biomedical Literature. Identify relevant documents and high-level protein-disease relationships using CaseOLAP LIFT, a computational protocol designed to investigate sub-cellular proteins and their associations with disease through biomedical literature text mining. Complete this step to prepare the necessary information to inform the RAG workflow with targeted insights from these reports.
1. Run CaseOLAP LIFT Text Mining Analysis. Visit the CaseOLAP LIFT JoVE Protocol (steps 4-5 are not necessary for this analysis).
2. Move processed text documents. Ensure the parsed biomedical documents (pubmed.json) and their full text (pmid2full_text_sections.json) from step 3 are in the CaseOLAP LIFT data folder. Move these files into the RUGGED data folder using the below commands:
  mv <PATH_TO_FOLDER>/caseolap_lift/caseolap_lift_shared_folder/data/pubmed.json <PATH_TO_FOLDER>/RUGGED/data/text_corpus
  mv <PATH_TO_FOLDER>/caseolap_lift/caseolap_lift_shared_folder/data/ pmid2full_text_sections.json <PATH_TO_FOLDER>/RUGGED/data/text_corpus
3. Move text mining results. Verify the knowledge graph file (merged_edge_list.tsv) with protein-disease associations was generated in the result/kg folder. Check the number of associations is as expected, dependent on the selected settings from steps 1-3 (see Table 2 for example). Move this file to the data folder of RUGGED:
  mv <PATH_TO_FOLDER>/caseolap_lift/caseolap_lift_shared_folder/result/graph_data/ merged_edge_list.tsv <PATH_TO_FOLDER>/RUGGED/data/knowledge_graph
Extract biomedical knowledge. Assemble a biomedical knowledge graph using the Know2BIO software, which integrates data from 30 biomedical knowledge bases. Complete this step to ensure the information for these biomedical relationships and multi-modal data are processed to support the downstream RAG workflow.
1. Clone Know2BIO repository. Clone the repository by typing in the command line, using the below command. Navigate to the Know2BIO repository.
  git clone https://github.com/Yijia-Xiao/Know2BIO.git.
2. Prepare data and licenses. Navigate to the dataset folder and follow the instructions in the `README.md` file. Complete the necessary creation of user accounts to access various online resources (e.g., UMLS thesaurus, Drug Bank).
3. Download knowledge base resources. Execute the `python create_edge_files.py` script and monitor the progress of the knowledge graph extraction pipeline. Ensure the .csv file in the `Know2BIO/dataset/output` folder representing biomedical relationships was generated.
4. Construct knowledge graph. Execute the `python prepare_kgs.py` script to integrate the information extracted in the previous step to automatically combine the extracted relationships into a unified knowledge graph, formatting the graph by data source and domain.
5. Verify the output. Check that the completed files are present within the `whole_kg.txt` file in the `Know2BIO/dataset/know2bio_dataset` directory. Confirm the number of edges in the file is as expected; see Table 3, which resulted in over 6 million edges. Proceed to the following step, as the remaining steps in the Know2BIO README are not required for this analysis.
  NOTE: The relationships from Know2BIO in Table 3 were sources from 31 sources, including ATC (World Health Organization), Bgee⁴⁵, CTD⁴⁶, ClinGen⁴⁷, ClinVar⁴⁸, DOID⁴⁹, DisGeNET⁵⁰, DrugBank⁵¹, GRNdb⁵², Gene Ontology⁵³, HGNC⁵⁴, Hetionet³, Inxight Drugs⁵⁵, KEGG⁵⁶, MeSH⁵⁷, Mondo⁵⁸, MyChem.info⁵⁹, MyDisease.info⁵⁹, MyGene.info⁵⁹, OMIM⁶⁰, PathFX⁶¹, PharmGKB⁶², PubMed, Reactome⁶³, SIDER⁶⁴, SMPDB⁶⁵, STRING⁶⁶, TTD⁶⁷, UMLS⁶⁸, Uberon⁶⁹, and UniProt⁷⁰.
6. Move knowledge graph results. Move the file into the `/data/knowledge_graph/` of the RUGGED directory.
  mv <PATH_TO_FOLDER>/Know2BIO/dataset/know2bio/whole_kg.txt <PATH_TO_FOLDER>/RUGGED/data/knowledge_graph
Construct a combined knowledge graph. Integrate the graph from the previous step with the high-level protein-disease relationships from text mining from step 2.1 into a single unified knowledge graph.
1. Verify Results in the RUGGED directory. Verify the knowledge graph construction result file (whole_kg.txt) and the text mining relationship results (merged_edge_list.tsv) are in the knowledge_graph directory within the data folder.
2. Integrate the Results. Execute `combine_kg_results.py` script to merge the extracted relationships and entities from the text mining analysis and knowledge graph construction into a single cohesive knowledge graph. Follow the example command below:
  python rugged/knowledge_graph/combine_kg_results.py ./data/knowledge_graph/merged_edge_list.tsv ./data/knowledge_graph/whole_kg.txt --output_dir ./data/rugged_knowledge_graph
Filter knowledge graph. (Optional) Sample a subset of the knowledge graph which will be used for the predictive analysis. This step retains only closely related relationships and reduces the computational resources necessary to execute the deep learning predictions.
1. Identify relevant nodes. Determine the biomedical entities of interest for the predictive analysis in step 3 by reviewing the knowledge graph and pinpointing relevant nodes.
  NOTE: This protocol focuses on disease nodes for Arrhythmogenic Cardiomyopathy (ACM) and Dilated Cardiomyopathy (DCM), as MeSH_Disease: D019571 and MeSH_Disease: D002311, respectively. Target nodes need to be tailored to the intended use case.
2. Sample from the knowledge graph. Use the `filter.py` script to extract the knowledge graph subgraph reachable within k-hop from the selected nodes of interest. Follow the example command below, which filters the graph reachable within 2 nodes from the selected disease nodes:
  python ./rugged/knowledge_graph/kg_filter.py --k 2 --disease “MeSH_Disease:D019571,MeSH_Disease:D002311" --input_file ./data/rugged_knowledge_graph/rugged_knowledge_graph_edges.csv —output_dir ./data/rugged_knowledge_graph/filtered_kg/.
  NOTE: Increasing the k-hop value (--k) expands the data scope within the graph for prediction analysis but also demands greater computational resources.

3. Explainable prediction analysis

NOTE: Execute GNNExplainer⁴⁴ on a Graph Convolutional Network model to predict potential edges (relationships) in the knowledge graph and provide insights into previously unknown associations.

Ensure the RUGGED Docker container is running. If the previous terminal window was closed, connect to the Docker container with the command `docker exec --it rugged /bin/bash`. Once connected to the Docker container, navigate to the RUGGED directory.
Determine the edge(s) to predict. Provide the edges as pairs of nodes in a .txt file (e.g., edges_to_predict.txt). The edges already existing in the knowledge graph will be filtered out from the predictions.
Run the prediction analysis script. Specify the edges to predict and the input knowledge graph as command line arguments for the prediction. Key arguments: -p (path to edges file), -i (input knowledge graph), -o (output directory), -n (top predictions, e.g., 5), -k (top edges to visualize, e.g., 10). Example command:
python rugged/predictive_analysis/generate_explainable_prediction.py -o output -n 5 -k 10 -p ./output/edges_to_predict.txt -i ./data/rugged_knowledge_graph/filtered_kg/filtered_k2_edges.csv
Evaluate model performance. Examine the terminal output or the `output.log` file generated from the previous step to assess model performance based on splitting the filtered knowledge graph into training, validation, and test sets with an 85:5:10 ratio. Adjust the model arguments if performance is not as expected, using Table 4 as an example.
Verify that the results are in the output folder. Examine the model results in `prediction_results.csv` and examine the top n predictions within the output folder. Review the top n predictions in the output folder. For each prediction, a graph visualization illustrates the most pertinent edges contributing to each prediction and their relative importance scores.
Move predictive analysis results. Once satisfied with the predictive analysis results, move the results into the `data/predictions/` of the RUGGED directory.

4. Hypothesis generation

Connect to the RUGGED Docker Container.
1. Ensure the RUGGED Docker container is running. If the previous terminal window was closed, connect to the Docker container.
2. Navigate to the RUGGED directory. Once connected, type cd /workspace/RUGGED to navigate to the directory. Issue the remaining steps in this command line window.
3. Verify the supporting services are running. If using Ollama and Neo4j in Docker, ensure the containers are running by typing `docker ps`. Repeat step 1.7 to verify services are functioning properly and step 1.4 to troubleshoot issues if they exist.
Prepare RAG data. Prepare the knowledge graph and text corpus for retrieval.
NOTE: These data may be substituted with user-defined data by placing the data into the `data/knowledge_graph/` and `data/text_corpus/` directories, respectively. These data must follow the format from the GitHub repository (https://github.com/pinglab-utils/RUGGED/tree/main/data).
1. Verify the resources. Ensure the text corpus is in the `data/text_corpus/` directory, the knowledge graph with text mining predictions file is in the data/knowledge_graph/ directory, and the prediction results are in data/predictions/ directory (from steps 2.1.2., 2.3.2., and 3.5. respectively).
2. Populate the graph database. Execute the command `python ./neo4j/prepare_neo4j.py` to create the necessary nodes, edges, and node features.
3. Index the Text Corpus. Execute the command `python ./text/prepare_corpus.py` to index the text corpus and enable RUGGED to retrieve relevant text documents based on user queries by chunking the documents into sections of 500 tokens to create a vector database using BART⁷¹.
4. Optional) Test the graph database retrieval. Send a test query to the Neo4j database to ensure it is populated correctly and can return the expected results. Verify that the output matches the expected nodes and relationships in the database. Example command:
  python ./test/test_neo4j_retrieval.py --query "MATCH (n) RETURN n LIMIT 5"
5. (Optional) Test RAG corpus retrieval. Send a test query to the RAG text corpus to ensure the text retrieval system is working. Check that the retrieved documents are relevant to the query and that the embeddings are functioning as expected. Example command: python ./test/test_literature_retrieval.py --query "Which documents are related to using beta-blockers to treat cardiovascular disease?"
Interact with RUGGED. Start RUGGED in the command line interface to interact with the system. Execute the command `python rugged.py`. Query the system to retrieve relevant information using specific commands to interact with the knowledge graph and text corpus.
1. Query the knowledge graph. Extract specific information from the knowledge graph by posing the question in natural language, starting with the keyword "query". For example:
  query “What are the currently prescribed drugs classified as beta blockers, antiarrhythmic drugs, and antifibrotic drugs?"
2. Explore the Predictions. Explore link prediction analyses from step 3, and ask to search for a specific relationship, leading with the keyword "predict". For example:
  predict, “Which of these drugs could potentially be used to treat ACM and/or DCM that is not currently known?"
3. Explore literature retrieval. Explore documents related to a specific biomedical topic from step 2. Pose the question in natural language, leading with the keyword "search". For example:
  search, "What literature evidence supports the claim that these predicted drugs could be used to treat ACM and/or DCM?"
4. Iterate and refine query. Respond directly in the command line to iterate and refine inquiries using RUGGED's chat-like interface. Refer to previous user-system conversations to revise and refine questioning and queries.
5. Rerun Cypher Commands in Neo4j. (Optional) Refine the knowledge graph query results by adjusting the provided Cypher command used to retrieve the information. Rerun or modify this command by visiting the Neo4j browser interface from Step 1.4.4 (e.g., at http://localhost:7474). Paste and modify the Cypher commands as needed to refine queries and gather more specific insights.
6. Summarize Conversation. Review the retrieved information and summarize the conversation with RUGGED. Type the keyword summarize to output a summary of the interaction to a text file for later analysis. The full-text response will be displayed in the terminal.
7. Conduct a human-in-the-loop review to enhance the accuracy of the output by inspecting and modifying the system responses for readability and brevity before finalizing the summary.
8. Review Chat Logs. Inspect the full text of the interaction in the log folder in RUGGED. Retain these intermediate commands and conversations between LLM agents within RUGGED for troubleshooting and reproducibility.
Shutting down and restarting RUGGED.
1. Get Docker Container IDs. Use the command `docker ps` to list all running containers and obtain the container IDs for RUGGED, Neo4j, and Ollama. For all following commands, replace <container_id_for_rugged>, <container_id_for_neo4j>, and <container_id_for_ollama> with the actual container IDs.
2. Stop Docker Containers. Shut down RUGGED and the associated Docker containers using their container IDs.
  docker stop <container_id_for_rugged>
  docker stop <container_id_for_neo4j>
  docker stop <container_id_for_ollama>
  NOTE: Stopping these containers before shutting down the device is recommended to prevent potential data loss and ensure all processes close properly.
3. Restart Docker Containers. To restart the RUGGED system, use the container IDs to start the necessary Docker containers.
  docker start <container_id_for_rugged>
  docker start <container_id_for_neo4j>
  docker start <container_id_for_ollama>
4. Re-attach to Docker Network. If needed, use these commands to re-attach the containers to the network.
  docker network connect rugged_network <container_id_for_rugged>
  docker network connect rugged_network <container_id_for_neo4j>
  docker network connect rugged_network <container_id_for_ollama>
5. Verify service functionality. Upon restart, repeat steps 1.4-1.5 to ensure the software is working as expected.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

These representative results were obtained by following the procedure outlined in this protocol. A text mining association analysis was performed following the CaseOLAP LIFT protocol⁵ with default parameters, studying eight broad categories of cardiovascular diseases⁷² and their association with mitochondrial proteins (GO:0005739). In total, 635,696 reports through May 2024 were determined as relevant to these diseases; among them, 4,655 high-confidence protein-disease associations were identified to inform downstream analyses. A biomedical knowledge graph was constructed using the software code from Know2BIO using default settings in May 2024⁹. The resulting knowledge graph consists of 219,450 nodes, 6,323,257 edges, as well as node features for 189,493 nodes with node descriptions, protein/gene sequences, chemical structure, etc. where available. An estimate of computational time for all steps in the protocol is presented in Table 1.

The RUGGED system was initialized by constructing the vector databases for both knowledge graph nodes and features as well as the CVD-relevant publications. All knowledge graph nodes, edges, and node features were processed with a chunk size of 20 tokens with the BART⁷¹ embedding model to prepare for RAG vector search. Similarly, original contributions and review articles were processed using a chunk size of 500 tokens and the BART embedding model to prepare for the RAG vector search. For literature retrieval, full-text publications greater than 500 tokens were hierarchically summarized based on the individual sections of a publication by the BART embedding model. The GPT-4o model was used for the remaining LLM agents in the system.

These representative results showcase an example use case to investigate potential drug therapeutics for Arrhythmogenic Cardiomyopathy (ACM) and Dilated Cardiomyopathy (DCM), identified as MeSH_Disease: D019571 and MeSH_Disease: D002311, respectively. A series of inquiries is outlined in Figure 3, with highlighted examples of model responses shown in Figure 4, and full response reported in Supplementary File 1, Section A. The direction of inquiry was adapted to the investigator-validated responses, crafting subsequent queries based on the results of the previous responses. The analysis revealed 11 drug candidates classified under beta blockers and antiarrhythmics. Novel avenues for therapeutic treatment were assessed using a Graph Convolutional Neural Network link prediction model on a subset of the complete knowledge graph, including nodes within 1-hop from study disease and drug nodes and their interconnections, with evaluation metrics reported in Table 4. The top 10 relevant edges for each prediction by the model were further examined by a graph explainability module, GNNExplainer⁴⁴, to identify the top nodes and edges contributing to each prediction, respectively. The total cost of using commercial LLM for all steps of the RUGGED protocol for this use case is estimated at $1.50 at the time of writing.

Biomedical knowledge graph diagram showing text mining and prediction analysis processes.
Figure 1: Retrieval Under Graph-Guided Explainable disease Distinction (RUGGED) workflow. RUGGED consists of four primary components: (1) assembling and processing data from ethically sourced and professionally managed resources (e.g., PubMed and curated biomedical knowledge bases), (2) integrating peer-reviewed research findings into a unified knowledge graph, (3) structuring the text and graph data within database services, (4) modeling and predicting explainable relationships among biomedical entities within the knowledge graph, and (5) retrieving and synthesizing knowledge through a Retrieval Augmented Generation (RAG) workflow (Figure 2) to validate complex molecular relationships and explore AI-driven disease predictions. A human-in-the-loop review step can be conducted by the user to enhance the accuracy of the output. Please click here to view a larger version of this figure.

Biomedical text and knowledge graph retrieval, similarity search, diagram of AI-based process.
Figure 2: Retrieval architecture and bias mitigation workflow. The Retrieval Augmented Generation (RAG) framework employs multiple LLM agents, each executing specific tasks to support access to relevant information based on the user query. This system provides documented evidence for the user-facing GPT-based Reasoning Agent, facilitating user-agent interaction and synthesis of knowledge. (1) Biomedical Text Retrieval: Peer-reviewed original contributions and review articles are filtered based on their relevance to understanding disease associations. A vector database is constructed for author and editor-validated text evidence weighted based on the corresponding section of the publication, respectively: 70% Abstract, 10% Results, 10% Metadata, and 10% for all other subsections. A keyword search and similarity search against the text embedding of the user query together identify relevant documents. Summaries of each document are generated using a BERT-based summarizer, with the GPT-based Text Evaluator Agent refining the search to validate query-document relevance. (2) Knowledge Graph Retrieval: A BERT-based named entity recognition and GPT-based relation extraction module connects the user query to relevant entities in the knowledge graph. A similarity search in a vector database identifies pertinent nodes and edges. Data is retrieved from the Neo4j database via Cypher queries generated by the GPT-based Cypher Query Agent and refined by the Query Verification Agent. (3) The individual responses from the Biomedical Text Retrieval or Knowledge Graph Retrieval pipelines are presented to the Reasoning Agent, which synthesizes a concise response with minimal bias to the user's query. This system is guided to maintain accuracy and impartiality in presenting factual information. Please click here to view a larger version of this figure.

Knowledge graph showing drug classifications and potential ACM/DCM treatments using predictive analysis.
Figure 3: Use case on knowledge synthesis and hypothesis exploration via tiered query cascade. This figure showcases a highlighted use case focusing on a chain of related questions and concepts an investigator and/or healthcare professional might pose to the RUGGED system. Queries from the user are presented to the system in numerical order, with arrows representing inferred logical and domain-specific reasoning among each question. The system retrieves from the implicit and relevant information (source shown in blue), responding to the query. Examples of system responses are presented in Figure 4. Please click here to view a larger version of this figure.

Analyzing drug interactions for cardiomyopathy; network diagram visualizing beta-blocker effects.
Figure 4: Use case cardiovascular pathology: elucidating CVD pathogenesis. Query-response pairs between the user and the RUGGED system are shown. In the upper left panel, questions 1-6 retrieve information by extracting information from the knowledge graph database to formulate evidence-rooted responses. Question 7 employs an explainable graph link prediction to identify top-scoring therapeutics. The query prompts a prediction analysis, which is executed and processed automatically by the system, and key findings are succinctly summarized. Question 8 evaluates literature evidence from the defined text data corpus that is retrieved as relevant evidence to verify, validate, and corroborate the predicted finding. System responses have been reviewed by a human-in-the-loop inspection process and modified for readability and brevity. A full transcript of these findings is detailed in the Supplementary File 1. Please click here to view a larger version of this figure.

Steps	Description	Time
Accessing Biomedical Knowledge		30% total
Prepare biomedical literature corpus	Connect to PubMed and PubMed Central, download and parse publication data for downstream tasks.	20%
Prepare knowledge base data	Connect to biomedical knowledge bases, download and parse necessary information for downstream tasks.	5%
Information Extraction		30% total
CaseOLAP LIFT Text Mining Analysis	Identify high level disease-protein relationships within the biomedical text corpus.	25%
Knowledge Graph Construction	Connect and integrate disparate information from biomedical knowledge bases into a unified knowledge graph.	5%
Prediction Analysis		10% total
Train Graph Neural Network	Train the model on the biomedical knowledge graph data to learn hidden patterns within the graph.	5%
Relevance Ranking Analysis	Apply explainability module to highlight the most pertinent nodes and edges relevant to study disease.	2.5%
Link Prediction	Utilize explainability module to identify key nodes and edges contributing to new predicted edges.	2.5%
Hypothesis Generation and/or Validation		30% total
Database Setup for Retrieval Augmented Generation	Initialize the graph database for querying the knowledge graph and the vector database for text retrieval.	25%
Hypothesis Exploration	Enable user interaction with RUGGED to access and scrutinize relevant information for hypothesis exploration.	5%

Table 1: Workflow and rate-limiting steps. This table provides rough estimates of the computational time required for each stage of the workflow. Rate-limiting steps include accessing, extracting, and indexing biomedical knowledge necessary for retrieval-augmented generation. Hypothesis exploration may be repeated continuously without the need to re-execute rate-limiting steps.

Disease Category	MeSH Tree Numbers	# PMIDs	# Original Contributions	# Review Articles
Cardiomyopathies (CM)	C14.280.238	132,531	102,337	19,942
Cardiomyopathies (CM)	C14.280.434	132,531	102,337	19,942
Cardiac Arrhythmias (ARR)	C14.280.067	125,286	92,374	13,854
Cardiac Arrhythmias (ARR)	C23.550.073	125,286	92,374	13,854
Congenital Heart Defects (CHD)	C14.280.400	82,006	54,023	6,379
Heart Valve Diseases (VD)	C14.280.484	72,016	50,119	5,743
Myocardial Ischemia (IHD)	C14.280.647	256,986	210,042	30,223
Cardiac Conduction System Disease (CCD)	C14.280.123	53,050	35,399	4,363
Ventricular Outflow Obstruction (VOO)	C14.280.955	22,244	15,504	1,686
Other Heart Diseases (OTH)	C14.280.195 C14.280.282 C14.280.383 C14.280.470 C14.280.945 C14.280.459 C14.280.720	114,085	77,302	11,799
	Total	635,696	478,404	69,690

Table 2: Biomedical literature statistics. This table details the study disease categories with their corresponding MeSH tree numbers and the number of PubMed documents retrieved from through May 2024, used as the corpus for text mining. A subset of these publications, consisting of original contribution research articles and review articles, is indexed into a vector database for retrieval by RUGGED during hypothesis generation.

Category	Number of Nodes	Number of Edges	Data Source(s)
Anatomy	5,049	122,533	Bgee, PubMed, MeSH, Uberon^,
Biological Process	27,047	108,106	Gene Ontology
Cellular Component	4,057	52,238	Gene Ontology
Compound	27,278	3,292,028	DrugBank, MeSH, CTD, UMLS, KEGG, TTD, SIDER, Inxight Drugs, Hetionet, PathFX, MyChem.info
Disease	21,938	311,773	PubMed, MeSH, DisGeNET, SIDER, ClinVar, ClinGen, PharmGKB, MyDisease.info, PathFX, UMLS, OMIM, Mondo, DOID, KEGG
Drug Class	5,721	8,283	ATC
Gene	29,810	943,419	HGNC, GRNdb, KEGG, ClinVar, ClinGen,
Molecular Function	11,151	47,086	SMPDB, DisGENET, PharmGKB, MyGene.info
Pathway	52,012	234,944	Gene Ontology
Protein	20,740	1,074,809	Reactome, KEGG, SMPDB
Reaction	14,647	128,038	UniProt, Reactome, TTD, SMPDB, STRING, HGNC
Subtotal	219,450	6,323,257	Reactome
Text-mining Associations	8	4,670
Total	219,458	6,327,927

Table 3: Knowledge graph statistics. This table details 11 broad biomedical categories comprising the constructed Know2BIO knowledge graph, enriched with additional edges derived from text mining analysis and predictive analysis. The resulting knowledge graph and predictions are managed by the Neo4j graph database for retrieval by RUGGED during hypothesis generation.

	Accuracy	Precision	Recall	F1-score	AUROC	AUPRC
Validation	0.7158	0.6639	0.8743	0.7547	0.8437	0.8637
Test	0.703	0.6367	0.9455	0.761	0.8961	0.9094

Table 4: Explainable AI model evaluation. This table reports the evaluation metrics for the knowledge graph link prediction using a two-layer graph convolutional neural network. Metrics were assessed by partitioning graph edges into 85% training, 5% validation, and 10% test datasets. Accuracy indicates the proportion of correctly classified predictions. Precision reports the proportion of correct positive predictions among all positive predictions. Recall measures the proportion of correct positive predictions among actual positive edges. The F1-score is the harmonic mean of precision and recall, balancing the two metrics. AUROC evaluates the model's ability to differentiate between positive and negative predictions. AUPRC quantifies the trade-off between precision and recall across different thresholds. With all metrics, higher values indicate better model performance.

Supplementary File 1: This file details the full model response from RUGGED and a comparison against GPT-4o. Section A presents the complete human-computer interaction with RUGGED, expanding on the chain-of-query approach outlined in Figure 3 and providing the full response beyond the summary highlighted in Figure 4. Section B evaluates GPT-4o's responses without retrieval against RUGGED's, assessing attributes such as precision, depth, confidence scoring, evidence reliability, and cost. Please click here to download this file.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The RUGGED protocol leverages modern language models with up-to-date information to empower investigators to dynamically explore the evolving biomedical landscape and uncover new knowledge. This human-computer interaction propels an innovative process that exemplifies the efficiency of the machine (RUGGED) and the expertise and judgment of the investigator. This protocol is designed to be executed in the outlined sequence. Step 1 details the software installation. Step 2 and and step 3 are essential for preparing biomedical literature and resources, while step 4 indexes this information for retrieval-augmented generation and user interaction with the LLM system. Time-intensive steps may run concurrently, and/or sequentially. For example, creating the Neo4j graph (step 4.2.2) can begin during prediction analysis (step 3), and indexing can begin after constructing the knowledge graph (step 2.3) and text-mining (step 2.1). These steps must be repeated to gain the final outcome of these intermediate results. While designed for biomedical information retrieval, this protocol, with minor modifications, may also handle other text and graph data, such as in-house data, clinical notes, or electronic health records. Data formatting details are in step 4.2.

The operation of this platform relies on the proper installation and interconnection of several technologies, including language models, graph databases, and vector databases (see Table of Materials). To verify these services are properly installed and connected, test scripts are provided in the `test` folder within the GitHub repository. External services may incur costs, with prices subject to change by the vendor. These optional services also have locally hosted alternatives, requiring only sufficient computational resources. However, these alternatives may impact model performance and/or convenience, making them unsuitable for some use-case scenarios.

With the rapidly evolving LLM landscape, new landmark models and task-specific models are released regularly. At the time of this report, the most appropriate models were chosen for the task. Users can choose which LLM to use by updating the configuration file accordingly (see steps 1.3.2-1.3.4). Model selection depends on their relevance to a particular use case. For example, incorporating models focused on ensuring model responses are fair, censored, and free of hate speech⁷³^,⁷⁴^,⁷⁵^,⁷⁶^,⁷⁷^,⁷⁸, into this workflow is essential for ethical considerations. Furthermore, prompt engineering is essential to guide reliable and responsible behavior from the LLM⁷⁹^,⁸⁰^,⁸¹^,⁸². The prompts crafted for the RUGGED workflow are tailored to the employed models and presented use cases. To fine-tune the prompts for a different use case, users can edit prompts within the RUGGED workflow in the `configuration` folder within the `prompts.json` file.

While RAG systems aim to reduce hallucinations in LLMs by grounding responses in evidence, these models may still lead to inaccurate information or generally true, non-specific responses. A benchmark comparison of RUGGED against GPT-4o is provided in Supplementary File 1, Section B. Model hallucinations often occur when retrieved information exceeds the model's context window, analogous to dementia with memory loss and inability to locate the data content, resulting in inaccurate responses⁸³^,⁸⁴^,⁸⁵. Choosing a suitable LLM model helps to mitigate this issue. For instance, GPT-4o has a context limit of 128k tokens, significantly more than GPT-3.5 Turbo's 16k token limit, albeit at a higher cost to the user. Furthermore, LLMs fine-tuned with specific domain knowledge can potentially enhance the accuracy and specificity of responses in biomedical applications⁸⁶^,⁸⁷^,⁸⁸. Despite these measures, it is essential to cross-check the information before proceeding with costly wet lab experiments.

RUGGED leverages explainable AI within a RAG pipeline to scrutinize link predictions, identifying both reliable and previously undiscovered relationships. While traditional RAG systems rely on bulk similarity-based retrieval, this approach connects explainability with a targeted response augmentation. Table 4 highlights the model's strong performance, demonstrating high recall (validation: 0.975 test: 0.976) and balanced F1-scores (validation: 0.796, test: 0.797), indicating reliability in identifying true positives, albeit with a higher rate of false positives. The model's robustness is further supported by its AUROC (validation: 0.963, test: 0.964) and AUPRC (validation: 0.971, test: 0.972) values. Precision (validation: 0.673, test: 0.674), however, could benefit from threshold tuning, incorporating detailed node features, or improved handling of class imbalance. The model's effectiveness is highly dependent on the input knowledge graph; overfitting is a risk with smaller graphs, while larger graphs demand greater computational resources. However, any RAG-based approach depends heavily on the quality of the data underlying the retrieval. For example, the construction of a knowledge graph is often time- and labor-intensive due to intrinsic noise on the original graph. This requires manual effort to denoise and label as well as ongoing costs for maintenance and update for the databases.

RUGGED's primary use is in knowledge synthesis and hypothesis exploration. By investigating various hidden relationships, such as disease mechanisms and drug treatments, RUGGED efficiently conducts literature triage. To reduce computational burden, most applications can be hosted on a server (e.g., AWS or computational server) and configured to update periodically with the latest information. Furthermore, this workflow can be adapted to accomplish domain-specific applications, such as serving as a platform to include patient data with local models to uphold security, privacy, and confidentiality. Beyond biomedical research, RUGGED's modular design allows it to support tasks across information retrieval, inference, and summarization by customizing the RAG pipeline and prompt engineering strategies tailored to the target domain. Successful adaptation necessitates careful consideration of domain-specific challenges, such as pre-processing of diverse data formats and evaluating the appropriate models for task- and domain-specific needs.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have nothing to disclose.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors would like to thank Dr. Alex Bui for his guidance and thoughtful discussion. In addition, we thank Dr. Ding Wang for his helpful discussions. This work was supported in part by NIH 1U54HG012517-01 to P.P., K.W., and W.W.; NIH T32 HL13945 to A.R.P.; National Science Foundation Research Traineeship (NRT) 1829071 to A.R.P.; and the TC Laubisch Endowment to P.P. at UCLA.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
Hardware/Software - Graphics Card and software driver	Nvidia	https://www.nvidia.com	A graphics card and its associated driver software are highly recommended to significantly reduce runtime for computationally intensive tasks, such as local LLM and predictive analyses. For devices equipped with an NVIDIA RTX GPU, download and install the necessary drivers and CUDA Toolkit from the NVIDIA website (https://developer.nvidia.com/cuda-downloads).
Software - Commercial Large Language Model Service	OpenAI	https://openai.com	RUGGED supports the OpenAI API for models such as GPT-3.5 and GPT-4o. To setup using OpenAI models, first obtain an OpenAI API Key. Proceed to OpenAI's website (https://openai.com/blog/openai-api) to create an account, load funds, and obtain an API key. This API key is required to enable RUGGED to use OpenAI models. Determine which LLM agents within RUGGED's system will use OpenAI models from their documentation (https://platform.openai.com/docs/models). NOTE: OpenAI API is a paid service. At time of publication, cost for GPT-4o is $5.00 per 1 million input tokens and $2.50 per 1 million output tokens (For more, visit https://openai.com/pricing).
Software - Containerization	Docker	https://www.docker.com	Docker aids in maintaining a consistent computational runtime environment, streamlining software installation and execution across different machines. To install Docker, visit the Docker website (https://www.docker.com/), click on 'Get started', download and install the appropriate version for the OS. Verify installation by typing `docker --version` in the terminal; successful installation reports the Docker version installed.
Software - Graph Database	Neo4j	https://neo4j.com	Neo4j is a graph database software that efficiently manages and queries graph-based nodes and relationships. RUGGED supports Neo4j in multiple forms: Docker container, Neo4j Desktop, or Neo4j AuraDB online server. Choose the option best suited to the use case. Setting up Neo4j as a Docker container. Run these commands to set up Neo4j in Docker, with the file-path for the folder (e.g., /Users/username/RUGGED) as 'PATH_TO_FOLDER'. For more details on troubleshooting, refer to the Neo4j Docker website (https://hub.docker.com/_/neo4j). docker pull neo4j docker run –name neo4j --net rugged_network --publish=7474:7474 --publish=7687:7687 -d -v 'PATH_TO_FOLDER'\neo4j\data:/data neo4j NOTE: Initialize Neo4j in Docker for the first time by setting a username and password. Run the neo4j_setup.py script (e.g., python neo4j_setup.py) or via web interface at http://localhost:7474. Setting up Neo4j Desktop. If using Neo4j Desktop, download and install from Neo4j website (https://neo4j.com/). Create a new project by clicking "New", then click "Add" to create a new Database Management System (DBMS). Select "Local DBMS", set a password, click "Create", then click "Start". A green "ACTIVE" text indicates it is running. Setting up Neo4j AuraDB. Visit the Neo4j website at (https://neo4j.com/cloud/aura-free/) to create an account and log in. Select "New Instance" to create an empty instance and save the URI and initial password to access the bolt interface (e.g., bolt://myurl.neo4j.com). Click the play button to start the instance, which will display the connection URI in the information box. NOTE: Neo4j AuraDB offers a free tier up to 200,000 nodes and 400,000 relationships. For larger graphs, visit Neo4j pricing (https://neo4j.com/pricing).
Software - Local Large Language Model Service	Ollama	https://ollama.com	RUGGED supports the use of local models using Ollama (e.g., Llama3). To enable, first install Ollama on the device or download the Docker container. To install Ollama, visit the Ollama website (https://ollama.com/download) and follow installation instructions. To install Ollama on Docker, run the following command: docker pull ollama/ollama NOTE: At time of publication, there is no stable release for Ollama on Windows OS.
Software - Version control	Git	https://www.git-scm.com	Version control software enables efficient installation and updating of software. To install Git, visit the Git website (https://www.git-scm.com/), click on 'Downloads', download and install the appropriate version for the OS. Verify installation by typing `git --version` in the terminal; successful installation will report the version of Git installed.

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Wooller, S. K., Benstead-Hume, G., Chen, X., Ali, Y., Pearl, F. M. G. Bioinformatics in translational drug discovery. Biosci Rep. 37 (4), BSR20160180(2017).
Sadybekov, A. V., Katritch, V. Computational approaches streamlining drug discovery. Nature. 616 (7958), 673-685 (2023).
Himmelstein, D. S., et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife. 6, e26726(2017).
Gutiérrez-Sacristán, A., et al. Text mining and expert curation to develop a database on psychiatric diseases and their genes. Database (Oxford). 2017, bax043(2017).
Pelletier, A. R., et al. A knowledge graph approach to elucidate the role of organellar pathways in disease via biomedical reports. J Vis Exp. (200), e65084(2023).
Santos, A., et al. A knowledge graph to interpret clinical proteomics data. Nat Biotechnol. 40 (5), 692-702 (2022).
Zheng, S., et al. PharmKG: A dedicated knowledge graph benchmark for bomedical data mining. Briefings in Bioinformatics. 22 (4), bbaa344(2021).
Soman, K., et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics. 40 (9), btae560(2023).
Xiao, Y., et al. Know2BIO: A comprehensive dual-view benchmark for evolving biomedical knowledge graphs. ArXiv. , (2023).
Thirunavukarasu, A. J., et al. Large language models in medicine. Nat Med. 29 (8), 1930-1940 (2023).
Lehman, E., et al. Do we still need clinical language models. ArXiv. , (2023).
Singhal, K., et al. Large language models encode clinical knowledge. Nature. 620, 172-180 (2022).
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., Sontag, D. Large language models are few-shot clinical information extractors. ArXiv. , (2022).
Johnson, D., et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Res Sq. , (2023).
Evaluation of ChatGPT on biomedical tasks: A zero-shot comparison with fine-tuned generative transformers. Jahan, I., Laskar, M. T. R., Peng, C., Huang, J. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, , 326-336 (2023).
Samaan, J. S., et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg. 33 (6), 1790-1796 (2023).
Thirunavukarasu, A. J., et al. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 9, e46599(2023).
Sun, W., et al. Is ChatGPT Good at search? Investigating large language models as re-ranking agents. ArXiv. , (2023).
Xu, R., Feng, Y., Chen, H. ChatGPT vs. Google: A comparative study of search performance and user experience. ArXiv. , (2023).
TruthfulQA: Measuring how models mimic human falsehoods. Lin, S., Hilton, J., Evans, O. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers, , 3214-3252 (2022).
Manakul, P., Liusie, A., Gales, M. J. F. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. ArXiv. , (2023).
FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. Min, S., et al. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, , 12076-12100 (2023).
Zhang, J., et al. Is ChatGPT fair for recommendation? Evaluating fairness in large language model recommendation. Proceedings of the 17th ACM Conference on Recommender Systems. , 993-999 (2023).
Sankar, B. S., et al. Building an ethical and trustworthy biomedical AI ecosystem for the translational and clinical integration of foundation models. Bioengineering. 11 (10), 984(2024).
Shen, Y., et al. ChatGPT and Other large language models are double-edged swords. Radiology. 307 (2), e230163(2023).
Li, H., et al. Ethics of large language models in medicine and medical research. Lancet Digit Health. 5 (6), e333-e335 (2023).
Lewis, P., et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. ArXiv. , (2020).
Gao, Y., et al. Retrieval-augmented generation for large language models: A survey. ArXiv. , (2023).
Wei, C. -H., Allot, A., Leaman, R., Lu, Z. PubTator central: Automated concept annotation for biomedical full text articles. Nucleic Acids Res. 47 (W1), W587-W593 (2019).
Wei, C. -H., et al. PubTator 3.0: An AI-powered literature resource for unlocking biomedical knowledge. ArXiv. , (2024).
Comparative Reasoning for knowledge graph fact checking. Liu, L., Ji, H., Xu, J., Tong, H. 2022 IEEE International Conference on Big Data (Big Data), , 2309-2312 (2022).
Knowledge Graph reasoning and its applications. Liu, L., Tong, H. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, , 5813-5814 (2023).
Liu, L., et al. Logic query of thoughts: Guiding large language models to answer complex logic queries with knowledge graphs. ArXiv. , (2024).
Barack's wife hillary: Using Knowledge graphs for fact-aware language modeling. Logan, R., Liu, N. F., Peters, M. E., Gardner, M., Singh, S. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, , 5962-5971 (2019).
Sun, J., et al. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. ArXiv. , (2024).
Wen, Y., Wang, Z., Sun, J. MindMap: Knowledge Graph prompting sparks graph of thoughts in large language models. ArXiv. , (2024).
Wang, C., Liu, X., Song, D. Language models are open knowledge graphs. ArXiv. , (2020).
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. Yasunaga, M., Ren, H., Bosselut, A., Liang, P., Leskovec, J. Proceedings of the 2021 Conference of the North American Chapter of the, , 535-546 (2021).
SimKGC: Simple contrastive knowledge graph completion with pre-trained language models. Wang, L., Zhao, W., Wei, Z., Liu, J. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers, , 4281-4294 (2022).
Lazar, A. Graph neural networks for link prediction. FLAIRS. 36, (2023).
Zhang, M., Chen, Y. Link prediction based on graph neural networks. ArXiv. , (2018).
XGNN: Towards model-level explanations of graph neural networks. Yuan, H., Tang, J., Hu, X., Ji, S. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, , (2020).
CFGExplainer: Explaining graph neural network-based malware classification from control flow graphs. Herath, J. D., Wakodikar, P., Yang, P., Yan, G. 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), , 172-184 (2022).
Ying, R., Bourgeois, D., You, J., Zitnik, M., Leskovec, J. GNNExplainer: Generating explanations for graph neural networks. Adv Neural Inf Process Syst. 32, 9240-9251 (2019).
Bastian, F. B., et al. The Bgee suite: Integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Res. 49 (D1), D831-D847 (2021).
Davis, A. P., et al. Comparative Toxicogenomics Database (CTD): Update 2023. Nucleic Acids Res. 51 (D1), D1257-D1262 (2023).
Rehm, H. L., et al. ClinGen - The clinical genome resource. N Engl J Med. 372 (23), 2235-2242 (2015).
Landrum, M. J., et al. ClinVar: Improvements to accessing data. Nucleic Acids Res. 48 (D1), D835-D844 (2020).
Schriml, L. M., et al. The human disease ontology 2022 update. Nucleic Acids Res. 50 (D1), D1255-D1261 (2022).
Piñero, J., Saüch, J., Sanz, F., Furlong, L. I. The DisGeNET cytoscape app: Exploring and visualizing disease genomics data. Comput Struct Biotechnol J. 19, 2960-2967 (2021).
Knox, C., et al. DrugBank 6.0: The DrugBank knowledgebase for 2024. Nucleic Acids Res. 52 (D1), D1265-D1275 (2024).
Fang, L., et al. GRNdb: Decoding the gene regulatory networks in diverse human and mouse conditions. Nucleic Acids Res. 49 (D1), D97-D103 (2021).
Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 49 (D1), D325-D334 (2021).
Seal, R. L., et al. Genenames.org: The HGNC resources in 2023. Nucleic Acids Res. 51 (D1), D1003-D1009 (2023).
Siramshetty, V. B., et al. NCATS Inxight Drugs: A comprehensive and curated portal for translational research. Nucleic Acids Res. 50 (D1), D1307-D1316 (2022).
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45 (D1), D353-D361 (2017).
Lipscomb, C. E. Medical Subject Headings (MeSH). Bull Med Libr Assoc. 88 (3), 265-266 (2000).
Vasilevsky, N. A., et al. Mondo: Unifying diseases for the world, by the world. medRxiv. , (2022).
Lelong, S., et al. BioThings SDK: A toolkit for building high-performance data APIs in biomedical research. Bioinformatics. 38 (7), 2077-2079 (2022).
Amberger, J. S., Bocchini, C. A., Scott, A. F., Hamosh, A. OMIM.org: Leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47 (D1), D1038-D1043 (2019).
Wilson, J. L., et al. PathFX provides mechanistic insights into drug efficacy and safety for regulatory review and therapeutic development. PLoS Comput Biol. 14 (12), e1006614(2018).
Gong, L., Whirl-Carrillo, M., Klein, T. E. PharmGKB, an Integrated resource of pharmacogenomic knowledge. Curr Protoc. 1 (8), e226(2021).
Gillespie, M., et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 50 (D1), D687-D692 (2022).
Kuhn, M., Letunic, I., Jensen, L. J., Bork, P. The SIDER database of drugs and side effects. Nucleic Acids Res. 44 (D1), D1075-D1079 (2016).
Jewison, T., et al. SMPDB 2.0: Big improvements to the small molecule pathway database. Nucleic Acids Res. 42 (Database issue), D478-D484 (2014).
Szklarczyk, D., et al. STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47 (D1), D607-D613 (2019).
Zhou, Y., et al. Therapeutic target database update 2022: Facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res. 50 (D1), D1398-D1407 (2022).
Bodenreider, O. The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 32 (Database issue), D267-D270 (2004).
Haendel, M. A., et al. Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon. J Biomed Semantics. 5, 21(2014).
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51 (D1), D523-D531 (2023).
Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Lewis, M., et al. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, , 7871-7880 (2020).
Sigdel, D., et al. Cloud-based phrase mining and analysis of user-defined phrase-category association in biomedical publications. J Vis Exp. (144), e59108(2019).
Ferrara, E. Should ChatGPT be biased? Challenges and risks of bias in large language models. FM. ArXiv. , (2023).
Gallegos, I. O., et al. Bias and fairness in large language models: A Survey. ArXiv. , (2023).
Hosseini, M., Horbach, S. P. J. M. Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other large language models in scholarly peer review. Res Integr Peer Rev. 8 (1), 4(2023).
Kotek, H., Dockum, R., Sun, D. Gender bias and stereotypes in Large Language Models. Proceedings of The ACM Collective Intelligence Conference, , 12-24 (2023).
Kamruzzaman, M., Kim, G. L. Prompting techniques for reducing social bias in LLMs through System 1 and System 2 Cognitive Processes. ArXiv. , (2024).
Raza, S., Raval, A., Chatrath, V. MBIAS: Mitigating bias in large language models while retaining context. ArXiv. , (2024).
Chen, B., Zhang, Z., Langrené, N., Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A comprehensive review. ArXiv. , (2023).
White, J., et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. ArXiv. , (2023).
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: Tutorial. J Med Internet Res. 25, e50638(2023).
Wang, J., et al. Prompt Engineering for Healthcare: Methodologies and applications. ArXiv. , (2023).
Luo, Y., et al. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. ArXiv. , (2023).
Xu, P., et al. Retrieval meets Long Context Large Language Models. ArXiv. , (2023).
Chen, S., Wong, S., Chen, L., Tian, Y. Extending context window of Large Language Models via positional interpolation. ArXiv. , (2023).
Labrak, Y., et al. BioMistral: A collection of open-source pretrained large language models for medical domains. ArXiv. , (2024).
Luo, R., et al. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 23 (6), bbac409(2022).
Wang, C., et al. A survey for large language models in biomedicine. ArXiv. , (2024).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Evidence-based Knowledge Synthesis and Hypothesis Validation: Navigating Biomedical Knowledge Bases via Explainable AI and Agentic Systems

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

References

Reprints and Permissions

Tags

Related Articles