June 13th, 2025
This article describes RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction), which integrates Large Language Model (LLM) inference with Retrieval-Augmented Generation (RAG). It draws evidence from expert-curated biomedical knowledge bases and peer-reviewed biomedical publications to synthesize new knowledge from up-to-date information, identify explainable and actionable predictions, and pinpoint promising directions for hypothesis-driven investigations.
This protocol presents a platform to reliably explore biomedical and clinical questions, and for hypothesis generation. Rugged helps explore the biomedical landscape by leveraging large language models, connecting them to peer-reviewed publications and curated biomedical knowledge bases, as well as using explainable AI to uncover new relationships. Recent advances in generative AI and large language models have transformed how we engage with evidence-supported biomedical resources, enabling tasks such as summarization, question answering, and flexible hypothesis exploration. Earlier approaches relied on text mining to extract patterns and high-level relationships from biomedical literature. Today, approaches are combining large language models with retrieval-augmented generation, agentic systems, and tool-calling capabilities. Many publicly available language models struggle with reliability, potentially producing factually incorrect information. While recent models have improved, their output at the time of publication often lacked domain specificity, rely on vague general language, and produce lengthy and fragmented explanations. In previous publications with JoVE, we highlighted how text mining and biomedical knowledge graph modeling are applied to predict and understand relationships between proteins, cellular components, and cardiovascular disease. Building on this foundation, our latest research focuses on integrating this structured biomedical knowledge with large language model supported workflows, enabling accurate inference and evidence-based responses.
[Narrator] To begin, start the Rugged service with the command in the terminal. Extract biomedical literature and identify relevant documents, along with high level protein disease relationships using caseOLAP LIFT. Visit the caseOLAP LIFT JoVE protocol and perform the caseOLAP LIFT text mining analysis. Next, clone the Know2BIO repository in the terminal. Using the command line, execute the create_edge_files.py script to download the knowledge base resources and monitor the progress of the extraction pipeline. Then, construct the knowledge graph with the prepare_kgs.py script. Integrate the results of the combine_kg_results.py script to merge the relationships and entities extracted from the text mining analysis and knowledge graph construction into one comprehensive graph. Identify biomedical entities of interest by reviewing the knowledge graph and selecting relevant nodes for use in predictive analysis. Use the filter.py script to extract a sub graph reachable within two hops from the selected disease nodes of interest and run the command. Run the prediction analysis script by specifying the edges to predict and the input knowledge graph as command line arguments and obtain the output. Now, connect to the Rugged Docker container. If the previous terminal window was closed, reconnect to the Docker container. Once connected, navigate to the Rugged directory with CD workspace Rugged in the command line, and perform all remaining steps within this command line window. After verifying that all supporting services are running, start Rugged in the command line interface to begin interacting with the system. To query the knowledge graph, pose a question in natural language starting with the keyword "query." For example, type "query what are the currently prescribed drugs classified as beta blockers?" Explore the predictions from the link prediction analysis with questions beginning with the keyword "predict." Then, retrieve documents related to a biomedical topic from step two in natural language using the keyword "search." Refine the inquiries iteratively using Rugged's chat-like interface in the same terminal window. Optionally, rerun and modify cipher commands in Neo4j to refine the knowledge graph query results. Summarize the entire interaction with the keyword "summarize" to output a text summary for later review, and conduct a human in the loop review to enhance the readability and accuracy of the system responses before finalizing the summary. Finally, review the chat logs in the log folder within Rugged and inspect the full text of the interaction. The knowledge graph constructed using Know2BIO included 219,450 nodes and 6,323,257 edges. The Rugged system embedded knowledge graph and publication data using the BART model for vector search, with publications longer than 500 tokens summarized section wise.
This article presents RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction), a platform that integrates Large Language Model inference with Retrieval-Augmented Generation. It aims to synthesize new knowledge from biomedical literature and knowledge bases, facilitating hypothesis generation and exploration of biomedical questions.