Multimodal Knowledge Graphs Based on Rule-Based Linguistic Analysis and Computer Vision

Preeti Vats; Nonita Sharma; Deepak Kumar Sharma; Alongbar Wary

doi:10.3791/69803

Method Article

Multimodal Knowledge Graphs Based on Rule-Based Linguistic Analysis and Computer Vision

DOI:

10.3791/69803

⸱

April 3rd, 2026

Preeti Vats¹ , Nonita Sharma¹ , Deepak Kumar Sharma¹ , Alongbar Wary¹

¹Indira Gandhi Delhi Technical University for Women

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

VISHAM-KG is a multimodal framework that constructs knowledge graphs from Hindi visual documents by aligning textual and visual entities. It combines rule-based linguistic analysis with computer vision techniques to produce subject-relation-object triplets in low-resource Indic language settings.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Visual-Semantic Hindi-Aligned Multimodal Knowledge Graph (VISHAM-KG) is a framework designed to construct consistent multimodal knowledge graphs (KGs) from Hindi visual documents by systematically aligning visual-text entities. The aim of this study is to integrate rule-based linguistic analysis with computer vision-based object detection that supports the structured semantic representation and grounded reasoning in low-resource Indic languages. The proposed algorithm begins with the preparation of Natural Language Processing (NLP) Hindi visual documents, followed by optical character recognition (OCR) for Devanagari script extraction and linguistic preprocessing, which includes various processes such as tokenization, lemmatization, part-of-speech tagging, and dependency parsing. In parallel, visual entities are extracted from images using object detection and filtered using confidence thresholds. Textual and visual entities are embedded into a shared semantic space using the multilingual transformer model XLM-R, along with CLIP-ViT, and aligned using cosine similarity-based thresholds. These aligned entities are combined with rule-based dependency relations to generate multimodal triplets. The protocol produces a structured multimodal knowledge graph encoded as subject-relation-object triplets with explicit visual grounding based on the Indian knowledge base. This resulting output will support cross-modal querying, entity alignment, and knowledge graph reasoning for Hindi visual documents and provide a replicable framework for multimodal knowledge construction in low-resource linguistic settings.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Knowledge graphs (KGs) are structured semantic graphical representations in which entities are modelled as nodes and relationships as edges. It enables efficient knowledge retrieval and contextual reasoning across various applications such as question answering, recommendation systems, and information extraction¹. Over the past decade, KG construction methodologies have been developed substantially. However, most existing approaches are designed for resource-rich languages, which rely predominantly on large-scale textual corpora². As a result, low-resource languages remain underrepresented, constraining the applicability....

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

No ethical approval is required for this protocol as it exclusively uses publicly available, non-human, non-sensitive visual and textual data. Table 2 provides all tools and techniques along with their dependencies. All source code, configuration files, and scripts required to reproduce the multimodal knowledge graph construction pipeline are available in a public GitHub repository(preeti017phdit22-wq/VISHAM_KG.). The repository includes installation instructions, and dependency specifications to facilitate reproducibility.

Mod....

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The proposed VISHAM-KG is evaluated through similarity score computation and link prediction tasks commonly used in the knowledge representation benchmark dataset.

Experimental setup

Evaluate the constructed multimodal knowledge graph using two established tasks: (i) cross-modal similarity assessment and (ii) knowledge graph link prediction. Perform all evaluations exclusively on the finalized graph output generated at the endpoint of the protocol. .......

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The performance of the VISHAM-KG framework is primarily based on three critical components: OCR for Devanagari text (step 1.2), confidence-based visual object detection using Clip-ViT (step 1.3) and embedding-based cross-modal alignment (step 1.4). OCR accuracy directly influences the downstream linguistic parsing and entity extraction. The errors introduced at this stage propagate to relation identification and reduce alignment precision. This effect is mitigated through Hindi-specific normalization, lemmatization, and .......

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
BiLSTM-CRF and Indic NER Model	Custom-trained	PyTorch	Named entity recognition
CLIP-ViT-B/32	2022-09	OpenAI	Visual embedding generation
CPU	Intel i9	Intel	General computation
EasyOCR	v1.7.1	Jaided AI	Hindi text extraction from images
GPU	NVIDIA RTX 3090	NVIDIA	Model inference acceleration
Hindi Kids Stories	10 stories	Curated dataset	Evaluation corpus
Neo4j	v5.13	Neo4j Inc.	Knowledge graph storage
NumPy	v1.24	NumPy Community	Numerical computations
Pandas	v2.0	Pandas Community	Data handling
Python	v3.10	Python Software Foundation	Pipeline implementation
PyTorch	v2.0	Meta AI	Deep learning framework
Stanza (Hindi Model)	v1.6.1	Stanford NLP	POS tagging and dependency parsing
XLM-R (Base)	2023-05	HuggingFace	Text embedding generation
YOLOv8	v8.0.208	Ultralytics	Visual object detection

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Alberts, A., et al. VisualSem: A high-quality knowledge graph for vision and language. arXiv. , (2020).
Chen, Y., et al. A survey on multimodal knowledge graphs: Construction, completion and applications. Mathematics. 11 (8), 1815-1835 (2023....

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Multimodal Knowledge Graphs Based on Rule-Based Linguistic Analysis and Computer Vision

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Materials

References

Reprints and Permissions

Tags

Related Articles