Method Article

Multimodal Knowledge Graphs Based on Rule-Based Linguistic Analysis and Computer Vision

DOI:

10.3791/69803

April 3rd, 2026

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

VISHAM-KG is a multimodal framework that constructs knowledge graphs from Hindi visual documents by aligning textual and visual entities. It combines rule-based linguistic analysis with computer vision techniques to produce subject-relation-object triplets in low-resource Indic language settings.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Visual-Semantic Hindi-Aligned Multimodal Knowledge Graph (VISHAM-KG) is a framework designed to construct consistent multimodal knowledge graphs (KGs) from Hindi visual documents by systematically aligning visual-text entities. The aim of this study is to integrate rule-based linguistic analysis with computer vision-based object detection that supports the structured semantic representation and grounded reasoning in low-resource Indic languages. The proposed algorithm begins with the preparation of Natural Language Processing (NLP) Hindi visual documents, followed by optical character recognition (OCR) for Devanagari script extraction and linguistic preprocessing, which includes various processes such as tokenization, lemmatization, part-of-speech tagging, and dependency parsing. In parallel, visual entities are extracted from images using object detection and filtered using confidence thresholds. Textual and visual entities are embedded into a shared semantic space using the multilingual transformer model XLM-R, along with CLIP-ViT, and aligned using cosine similarity-based thresholds. These aligned entities are combined with rule-based dependency relations to generate multimodal triplets. The protocol produces a structured multimodal knowledge graph encoded as subject-relation-object triplets with explicit visual grounding based on the Indian knowledge base. This resulting output will support cross-modal querying, entity alignment, and knowledge graph reasoning for Hindi visual documents and provide a replicable framework for multimodal knowledge construction in low-resource linguistic settings.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Knowledge graphs (KGs) are structured semantic graphical representations in which entities are modelled as nodes and relationships as edges. It enables efficient knowledge retrieval and contextual reasoning across various applications such as question answering, recommendation systems, and information extraction1. Over the past decade, KG construction methodologies have been developed substantially. However, most existing approaches are designed for resource-rich languages, which rely predominantly on large-scale textual corpora2. As a result, low-resource languages remain underrepresented, constraining the applicability....

Access restricted. Please log in or start a trial to view this content.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

No ethical approval is required for this protocol as it exclusively uses publicly available, non-human, non-sensitive visual and textual data. Table 2 provides all tools and techniques along with their dependencies. All source code, configuration files, and scripts required to reproduce the multimodal knowledge graph construction pipeline are available in a public GitHub repository(preeti017phdit22-wq/VISHAM_KG.). The repository includes installation instructions, and dependency specifications to facilitate reproducibility.

Mod....

Access restricted. Please log in or start a trial to view this content.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The proposed VISHAM-KG is evaluated through similarity score computation and link prediction tasks commonly used in the knowledge representation benchmark dataset.

Experimental setup

Evaluate the constructed multimodal knowledge graph using two established tasks: (i) cross-modal similarity assessment and (ii) knowledge graph link prediction. Perform all evaluations exclusively on the finalized graph output generated at the endpoint of the protocol. .......

Access restricted. Please log in or start a trial to view this content.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The performance of the VISHAM-KG framework is primarily based on three critical components: OCR for Devanagari text (step 1.2), confidence-based visual object detection using Clip-ViT (step 1.3) and embedding-based cross-modal alignment (step 1.4). OCR accuracy directly influences the downstream linguistic parsing and entity extraction. The errors introduced at this stage propagate to relation identification and reduce alignment precision. This effect is mitigated through Hindi-specific normalization, lemmatization, and .......

Access restricted. Please log in or start a trial to view this content.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
BiLSTM-CRF and Indic NER ModelCustom-trainedPyTorchNamed entity recognition
CLIP-ViT-B/322022-09OpenAIVisual embedding generation
CPUIntel i9IntelGeneral computation
EasyOCRv1.7.1Jaided AIHindi text extraction from images
GPUNVIDIA RTX 3090NVIDIAModel inference acceleration
Hindi Kids Stories10 storiesCurated datasetEvaluation corpus
Neo4jv5.13Neo4j Inc.Knowledge graph storage
NumPyv1.24NumPy CommunityNumerical computations
Pandasv2.0Pandas CommunityData handling
Pythonv3.10Python Software FoundationPipeline implementation
PyTorchv2.0Meta AIDeep learning framework
Stanza (Hindi Model)v1.6.1Stanford NLPPOS tagging and dependency parsing
XLM-R (Base)2023-05HuggingFaceText embedding generation
YOLOv8v8.0.208UltralyticsVisual object detection

References

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,
  1. Alberts, A., et al. VisualSem: A high-quality knowledge graph for vision and language. arXiv. , (2020).
  2. Chen, Y., et al. A survey on multimodal knowledge graphs: Construction, completion and applications. Mathematics. 11 (8), 1815-1835 (2023....

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Multimodal Knowledge GraphsRule Based Linguistic AnalysisComputer VisionVisual Entity ExtractionHindi Visual DocumentsOptical Character RecognitionDependency ParsingEntity AlignmentMultilingual TransformerKnowledge Graph Reasoning

Related Articles