Image Caption Generation using Deep Learning Approaches

Arun  Pratap Singh; Manish Manoria; Sunil Joshi

doi:10.3791/71528

Research Article

Image Caption Generation using Deep Learning Approaches

DOI:

10.3791/71528

⸱

June 12th, 2026

Arun Pratap Singh¹ , Manish Manoria² , Sunil Joshi¹

¹Samrat Ashok Technological Institute, ²Rungta Group of Institutes (R1)

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol utilizes CNNs, RNNs, and ResNets for image captioning, extracting descriptions of the images' activities, people, objects, and other elements. It has been justified with BLEU, CIDEr, METEOR, and ROUGE metrics scores.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Image caption generation is an endeavor to provide a meaningful textual description that involves an image. Extracted information is relevant to the activities present in the images. ResNet (Residual Network) is well known for its ability to classify images, having developed deep hierarchical representations. The intention of this paper is to use ResNet with various smart filters to classify images more deeply, enabling the generation of genuine and meaningful descriptions that are highly precise with respect to the reference captions. Here, the work uses a smart filtering technique to enhance images, a CNN to encode features, model training, and thereafter an RNN (Recurrent Neural Network) to decode the features. ResNet is a very effective model for computer vision tasks, especially object classification and semantic analysis. ResNet is well known for residual connections, which are also known as skipping connections that solve the vanishing gradient problem, which is a crucial problem in deep learning. Here, the MSCOCO (Microsoft Common Object in Context) benchmark is used to train the model, which is a large dataset with reference annotations useful for various computer vision tasks. ResNet helps enhance generalization capability, which is particularly useful for diverse images. As per the results obtained, BLUE scores are B1: 0.579, B2: 0.404, B3: 0.279, B4: 0.191; METEOR: 0.195; ROUGE: 0.396; and CIDEr: 0.6.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the fields of computer vision and natural language processing, image captioning is a crucial task that extracts a description of the image and the actions it depicts. The model's intention is to comprehend images and translate the information into meaningful sentences or captions¹. The whole procedure consists of two significant phases: the first is feature extraction, where a CNN model is used; the second is image description using an RNN & in between, ResNet is used for semantic analysis, sequence generation, and an attention mechanism. ResNet is very different from template-based methods or DenseNet-based modules because it uses skip conn....

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The implementation has been done with the main model, which is ResNet-152, along with an Encoder as CNN, a Decoder as RNN, and the resources from the Table of Materials.

ResNet-152
ResNet is considered the backbone for extracting features more efficiently in image captioning. ResNet provided better training performance than other models, as it addressed the vanishing gradient problem and efficiently resolved it. Various objects may appear in the images, and the model needs to understand their relationships for better captioning. That is why it can be considered a hierarchical feature extraction. ResNet-....

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Software and environment specifications
Python 3.10 was the main programming language used for the experiments. Visual Studio Code was used to set up the development environment (VS Code). Important libraries used in this research include Pickle for data serialization, multiprocessing for parallel processing, glob for file handling, and PyTorch for deep learning model development. The hardware configuration included 256 GB of storage, 8 GB of RAM, and an NVIDIA GTX series GPU with CUDA support for f.......

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the field of artificial intelligence, captioning images is a difficult task. Image captioning has been the subject of numerous studies, and acute or precise captioning still requires the highest level of precision. Many machine learning techniques can be used to accomplish the goal of image captioning, and numerous studies have used CNN, RNN, and ResNet-152. However, increased precision and reduced processing time are necessary. The proposed system is built using CNN as the encoder, RNN as the decoder, Torch Vision as.......

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported in this paper.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

We acknowledge the creators of the MSCOCO datasets for providing the benchmarks used in this study. The authors declare that no external funding was received for this study.

....

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
AMD Ryzen 5000 series	AMD	100-100000059WOF	AMD Ryzen 5000 Series is a line of high-performance processors developed by AMD, based on the Zen 3 architecture. These processors are widely used in desktops and laptops for both general-purpose computing and demanding tasks such as data processing and machine learning workflows.
GPU	NVIDIA	4.71933E+12	The NVIDIA GeForce GTX is a series of graphics processing units (GPUs) developed by NVIDIA, widely used for gaming as well as general-purpose computing tasks like deep learning and image processing.
Intel Core i5	Intel	BX8071514400F	Intel Core i5 is a mid-range processor series developed by Intel, widely used in personal computers for both general-purpose and computational tasks.
Python 3.10	Python Software Foundation	PEP 619	Python is a high-level, interpreted programming language widely used in scientific computing, data analysis, and machine learning. It is known for its simplicity, readability, and extensive ecosystem of libraries.
PyTorch	Facebook	26.03-py3	PyTorch is an open-source deep learning framework developed by Meta Platforms (formerly Facebook), widely used for building and training neural networks in research and industry.
Visual Studio Code	Microsoft	None	Visual Studio Code (VS Code) is a lightweight, open-source code editor developed by Microsoft. It is widely used for software development, including machine learning and deep learning projects.
Windows 11	Microsoft	KB5083631	Windows 11 is an operating system developed by Microsoft, widely used for general computing as well as software development and machine learning tasks.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Image Caption Generation using Deep Learning Approaches

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

Reprints and Permissions

Tags

Related Articles