$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
Visual-Semantic Hindi-Aligned Multimodal Knowledge Graph (VISHAM-KG) is a framework designed to construct consistent multimodal knowledge graphs (KGs) from Hindi visual documents by systematically aligning visual-text entities. The aim of this study is to integrate rule-based linguistic analysis with computer vision-based object detection that supports the structured semantic representation and grounded reasoning in low-resource Indic languages. The proposed algorithm begins with the preparation of Natural Language Processing (NLP) Hindi visual documents, followed by optical character recognition (OCR) for Devanagari script extraction and linguistic preprocessing, which includes various processes such as tokenization, lemmatization, part-of-speech tagging, and dependency parsing. In parallel, visual entities are extracted from images using object detection and filtered using confidence thresholds. Textual and visual entities are embedded into a shared semantic space using the multilingual transformer model XLM-R, along with CLIP-ViT, and aligned using cosine similarity-based thresholds. These aligned entities are combined with rule-based dependency relations to generate multimodal triplets. The protocol produces a structured multimodal knowledge graph encoded as subject-relation-object triplets with explicit visual grounding based on the Indian knowledge base. This resulting output will support cross-modal querying, entity alignment, and knowledge graph reasoning for Hindi visual documents and provide a replicable framework for multimodal knowledge construction in low-resource linguistic settings.