Research Article

Image Caption Generation using Deep Learning Approaches

June 12th, 2026

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol utilizes CNNs, RNNs, and ResNets for image captioning, extracting descriptions of the images' activities, people, objects, and other elements. It has been justified with BLEU, CIDEr, METEOR, and ROUGE metrics scores.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Image caption generation is an endeavor to provide a meaningful textual description that involves an image. Extracted information is relevant to the activities present in the images. ResNet (Residual Network) is well known for its ability to classify images, having developed deep hierarchical representations. The intention of this paper is to use ResNet with various smart filters to classify images more deeply, enabling the generation of genuine and meaningful descriptions that are highly precise with respect to the reference captions. Here, the work uses a smart filtering technique to enhance images, a CNN to encode features, model training, and thereafter an RNN (Recurrent Neural Network) to decode the features. ResNet is a very effective model for computer vision tasks, especially object classification and semantic analysis. ResNet is well known for residual connections, which are also known as skipping connections that solve the vanishing gradient problem, which is a crucial problem in deep learning. Here, the MSCOCO (Microsoft Common Object in Context) benchmark is used to train the model, which is a large dataset with reference annotations useful for various computer vision tasks. ResNet helps enhance generalization capability, which is particularly useful for diverse images. As per the results obtained, BLUE scores are B1: 0.579, B2: 0.404, B3: 0.279, B4: 0.191; METEOR: 0.195; ROUGE: 0.396; and CIDEr: 0.6.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the fields of computer vision and natural language processing, image captioning is a crucial task that extracts a description of the image and the actions it depicts. The model's intention is to comprehend images and translate the information into meaningful sentences or captions1. The whole procedure consists of two significant phases: the first is feature extraction, where a CNN model is used; the second is image description using an RNN & in between, ResNet is used for semantic analysis, sequence generation, and an attention mechanism. ResNet is very different from template-based methods or DenseNet-based modules because it uses skip connections that reduce execution time while improving performance. There are numerous applications of image captioning that include helping visually impaired people, boosting social media platforms, optimizing image-based search engines, image-based AI (artificial Intelligence), and many more2.

In computer vision, scene recognition is the process of identifying and classifying the image's general context or environment, such as a beach,  cityscape, forest, or office. Unlike object recognition, which focuses on individual items, scene recognition considers textures, spatial arrangements, and object relationships to understand the larger context. It uses CNNs and Vision Transformers, deep learning models trained on large datasets such as Places365 and ImageNet. Applications include security surveillance, augmented and virtual reality (AR and VR) for immersive experiences, robotics for environmental awareness, and autonomous vehicles for navigation. Despite advances, problems like shifting viewpoints, occlusions, and changing lighting make scene recognition a hot topic in computer vision and artificial intelligence research. Another fundamental problem in computer vision is scene recognition.

EnsCaption, a dual generative adversarial network model, was proposed to improve a generation–retrieval ensemble technique3. This layout enables harmonious, procreation-based image captioning methods that generate captions aligned with the existing goals. While the retrieval-based technique uses a position or grading-based model to select the best model for extracting information more precisely than the others in the image-based query. A mapping of images to a “meaning space” was introduced using visual components such as objects, activities, and scenes, which were then aligned with corresponding verbal templates4. Using the correlations and qualities found in the images, the approach constructs phrases. Sentences express information in a rich, condensed, and subtle way. Template-based caption generation was enhanced by incorporating commonsense knowledge to improve semantic understanding5. This technique extended the template's reach beyond direct image characteristics to encompass inferred associations. This work uses an existing object detection dataset to extract 16,000 common-sense statements for each annotated category. Additionally, generalization was achieved using WordNet, enabling the induction of a large number of facts about previously unseen objects6. Offers a review of an organized taxonomy of deep learning techniques for captioning images, including topics like attention mechanisms, reinforcement learning tactics, and encoder-decoder frameworks. Along with addressing issues such as object hallucination and contextual comprehension, it also examines commonly used datasets and assessment criteria. The authors point out areas for further study, such as improving vision-language pre-training techniques and reducing dataset bias. A semantic analysis approach based on convolutional neural networks and recurrent neural networks was explored for image captioning tasks7. Image captioning is one of the most well-known uses, allowing computers to produce evocative phrases that encapsulate an image. To provide high-level, significant semantic descriptions, this procedure entails more than merely identifying objects and scenes; it also involves examining their states, characteristics, and interactions. Despite the inherent complexity and difficulty of image captioning, academics have achieved impressive strides in the area. The three main deep neural network-based image captioning techniques covered in this study are CNN-RNN-based, CNN-CNN-based, and reinforcement learning frameworks. An end-to-end trainable model for image captioning was introduced, integrating computer vision and natural language processing to generate coherent descriptions of images8. To create a caption, it uses an encoder-decoder framework in which an LSTM decodes an image into a string of words after a pre-trained CNN encodes it into a feature vector. Notwithstanding its drawbacks, including difficulties with intricate sceneries, the paper's contribution to vision-language tasks is nevertheless fundamental9.

ResNet is the convolutional neural network (CNN) used in the proposed work's image captioning model to extract rich visual information from input images. ResNet serves as an encoder to produce a feature vector representing the image, which is usually used in an encoder-decoder architecture. The decoder, which generates word-by-word descriptive captions, receives these features and is often implemented using a recurrent neural network (RNN), such as LSTM or GRU. An attention mechanism can be added to improve performance by enabling the decoder to focus on specific regions of the image as it generates each word. To maximize caption accuracy, the model is trained end-to-end using a loss function such as cross-entropy and a dataset like COCO. Transfer learning and ResNet fine-tuning can enhance feature extraction, further strengthening the model and enabling it to produce high-quality, contextually appropriate captions across a wide range of images. In image captioning, ResNet is often preferred over other models because it effectively addresses the vanishing gradient problem, a common issue in deep neural networks. This is made possible by its novel residual learning approaches, which train considerably deeper networks without sacrificing performance by using skip connections to facilitate gradient flow during backpropagation. The multilayer perceptron, a fully connected feed-forward neural network, is associated with the trainable layer. The RNN then decodes captions using the softmax layer, producing candidate captions. The activation function is f(x), the forward identity function is f(x) + x, and x is regarded as identity, illustrated in Figure 1. In this case, the system uses residual blocks to calibrate the model during training, and its inputs pass through both weight connections and skip connections, also referred to as identity shortcuts.

Residual network diagram; F(x)+x; deep learning architecture with relu activation function.
Figure 1: Residual connection network. This figure illustrates the architecture of a residual network, highlighting skip connections that improve gradient flow and mitigate vanishing gradients during deep network training. Please click here to view a larger version of this figure.

Assume that Pl is the output; l is the no. of residual blocks; ReLU is supposed to be a customary block if it is close to 1, but if it is not equal to 1, then it can be computed as:

Neural network activation, formula: \(P_l = ReLU(id(P_{l-1}))\), function diagram.   (1)

Here, b is the random variable, and k is the mapping function.

Neural network equation showing ReLU activation function for iterative process in model computation.   (2)

Here sl is considered as the probability of survival for the proposed system;

Equation of neural network activation, ReLU formula, showing layer transformation process.  (3)

The resulting rule for survival probability as;

Static equilibrium formula \( S_l = 1 - \frac{l}{L}(1 - S_L) \), equation for educational use.   (4)

Where SL is supposed to probability survival as well as L is supposed to the total no. of blocks.

Image captioning is a challenging task that combines natural language processing and computer vision to produce descriptive textual captions for images. To do this, one must comprehend and interpret an image's visual content and translate it into coherent sentences within its context. In this field, having extensive and diverse datasets is crucial for model evaluation and training. These datasets offer a vast array of images and related annotations, which are crucial for developing and testing image captioning algorithms. The most frequently used datasets are MSCOCO and Flickr30k, which contain millions of images and pose various challenges in image processing. MSCOCO is much larger than Flickr30k11. The MS COCO dataset has been split into the following sets: 82,783 images for training, 40,504 for validation, and 40,775 for testing.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The implementation has been done with the main model, which is ResNet-152, along with an Encoder as CNN, a Decoder as RNN, and the resources from the Table of Materials.

ResNet-152
ResNet is considered the backbone for extracting features more efficiently in image captioning. ResNet provided better training performance than other models, as it addressed the vanishing gradient problem and efficiently resolved it. Various objects may appear in the images, and the model needs to understand their relationships for better captioning. That is why it can be considered a hierarchical feature extraction. ResNet-152 can handle complex computer vision tasks. The key advantage of this model is the effective use of residual or skip connections. It is highly effective at addressing the vanishing gradient issue. It can learn complex, robust features to achieve higher accuracy. ResNet-152 followed a bottleneck design that reduced computational cost and made it more effective than other architectures, such as VGG-16. It has a prominent transfer learning backbone suitable for pre-trained models and varied tasks such as object detection and data segmentation. The skip connection accelerated training and made it more stable. Compared to the transformer-based model, which uses a self-attention mechanism to understand sequential data, ResNet is quite different. A transformer-based model requires a large amount of data for a deep understanding of textual data, yielding effective results but running a bit slower. The motivation for selecting ResNet is its skip connections, which speed up execution with a significant improvement in results. In the field of image captioning, ResNet is used to extract the features to represent the object and the action performed in the image. ResNet used a residual network that leveraged skip connections. Here, the residual block can be computed with reference of input Z as:

Kinetic equation K=F(Z,{xi})+Z, mathematical formula, theoretical calculation concept. (5)

Where Z is considered the input of the residual block.
Mathematical equation, function F(Z, {xᵢ}) plus Z, symbolic representation, algebraic concept. is a residual function that involves batch normalization, convolutional layers, and ReLu activation. {xi} is considered the learning weight of the corresponding layers. Z also defines the skip connection identity, which also resolves the problem of vanishing gradient. ResNet is generally used as a feature extractor for visual feature mapping from the images. Here, I is considered as the input image for representing the feature maps into a high visual feature representation V.

ResNet equation. Deep learning architecture, neural network layers, computational graph.   (6)

Before extracting features, the image must be preprocessed to improve feature extraction. It is considered a raw image collected from the MSCOCO benchmark, so the first step in preprocessing is to resize and normalize it.

Mathematical symbol for tensor representation in machine learning equations, showing Ir ∈ RH×W×C.    (7)

Image resizing formula, I_resize = Resize(I_r, (H', W')), depicted in an educational equation format.   (8)

Where Hl is the height of the image and Wl is the weight of the image. Iresize is the resized image.

To normalize the pixel value from range [-1, 1] or [0, 1]

Normalization equation: I_normalization = (I_resize - μ) / σ.    (9)

Where μ is the considered as the mean value of the pixel σ is considered as the standard deviation of the referenced image. The normalized image is now further processed for feature extraction.

Feature extraction equation fₑ=Feature extractor(I normalization), machine learning concept.   (10)

Where Static equilibrium equation: \(f_e \in \mathbb{R}_r^d\); mathematical formula for force analysis. which is considered as the feature vector. When the row caption is tokenised then it is converted into the numeric format.

Mathematical formula showing sequence notation Tc={w1, w2, w3, …, wT}, illustrating static series.   (11)

If caption splits into words then

Tokenization process, equation C_token=Tokenize(C), used in computational algorithms.   (12)

Here, vocabulary plays an important role, with every word uniquely identified by integer-based indexing.

Mathematical formula: wₙ → Vc(wₙ) ∈ ℤ, demonstrating a mapping to integers.   (13)

Where Vc is considered as a vocabulary function; it has to be ensured that all the sequences have an even length; so the maximum height or ideal length is considered as Lmax.

Cpadding formula, condition-based notation, mathematical equation, educational use, diagram.   (14)

Now token get embedded as;

Word embedding equation: \(e_j = Embedding(V_c(w_j))\), emphasizes vector representation method.   (15)

for j = 1,2,3, ... .., Lmax

Where e ∈ R^k equation in statistics diagram, vector space definition, educational math concept is considered as an embedded vector with K dimensions; now the decoder is to be used to decode the caption for candidate caption generation, which is based on a probabilistic model.

Natural language processing formula for sequence prediction; mathematical equation.   (16)

Where wj is a work at time stamp j, w1: j-1 is the generated word at time stamp j-1 and ej-1 is the embedded feature with the previous word wj-1. At every time stamp, the network predicted the next upcoming word or probability is computed over the vocabulary.

Softmax function equation, probability calculation in neural networks, machine learning formula.   (17)

Where woutput is the output weight and boutput is the output bias. So the max probability is computed as

Weighted probability equation wj=avgmax P(wj), mathematical formula, educational use.   (18)

The maximum length of the candidate caption is computed once the word is received or identified as a special token like and . Beam search is also useful for selecting the better candidate caption, so the sequence is:

Mathematical expression for a set notation equation, displaying formula C={W1,W2,W3,...,WT}.   (19)

Natural language processing formula, Score(C), Σ log P, math equation for probability analysis.   (20)

So the generated candidate caption is the sequence of Combinatorics formula, set notation C={W₁,W₂,...,Wₓ}, mathematical concept diagram.

Long Short-Term Memory is generally used in sequence generation. LSTM uses a CNN as a feature extractor and generates words sequentially to create meaningful sentences. LSTM computes the forget gate at each time stamp T.

LSTM gate equation; ft=σ(wf[ht-1,yt]+bf); neural network formula; computational diagram.

Where ft is considered as forget gate, σ is considered as activation function, wf is considered as weight and bf as bias,

yt is considered as input feature vector, ht-1 is considered as hidden state.

LSTM equation It=σ(wj[ht-1,yt]+bj) for neural network gate activation analysis.   (22)

Long Short-Term Memory (LSTM) cell equation; showing activation function formula.   (23)

Jt is considered as input, static equilibrium ΣFx=0 diagram; force vector analysis; mechanics study is considered as candidate state, wj & wc are considered as weight for input and candidate state respectively, bj & bc or considered as bias.

LSTM cell formula: Ct = ft ⊙ Ct-1 + it ⊙ Ćt; neural network computation.   (24)

Ct is Considered as all state, Ct-1 is considered as previous state.

Recurrent neural network equation, Ot=σ(wo[ht-1,yt]+bo), algorithm formula.   (25)

Ot is considered as output, wo as weight and bo as bias. To initialize the hidden and cell states, the following computations are required.

Equation for neural network calculation: \(h_i = w_n \times k + b_n\).   (26)

equation for calculating concentration, formula: C_i=w_c*k+b_c, relevant for chemical analysis   (27)

Where hi and Ci are considered as the hidden and cell state, respectively, wh and wc are weights for hidden and sail cell state respectively, bc and bh are considered as bias, k is considered as the feature extractor. The sequence of the caption is computed as:

Static equilibrium; entropy formula; probability, log function; equation analysis.   (28)

Where T is the length of generated caption.

254 × 254 × 3 is the resized or pre-processed image, and I is considered as the input image.

Neural network layer equation, \(X^1 = \text{ReLU}(w*I + b)\), showcasing activation function.   (29)

Where W and b are considered as weight and bias, respectively, I is considered as input features, and ReLU is the activation function. It is the computation of the convolutional layer. Now the pooling layer can be computed as:

Static optimization equation, PL=max(Y^1), used in mathematical modeling and analysis.    (30)

After finalizing the pooling layer; the fully connected layer can be mapped as:

Neural network activation function formula, ReLU equation for layer output.   (31)

Where wf and bf are considered as weight and bias of the network respectively.

Set theory equation V={V₁,V₂,V₃,...,Vₙ}, useful in mathematical graph analysis, vertex definition.   (32)

Static equilibrium equation, Vj ∈ Yd, symbol for mechanical analysis studies.   (33)

Where N is considered as the spatial region and d as the dimension of the feature.

Neural network equation: \(h_o = w_nV + b_h\), illustrating hidden layer calculus in AI.   (34)

Mathematical equation for neural network output calculation: \(C_o = w_c V + b_c\).   (35)

Where wh and bh are considered as weight and bias of hidden state, respectively, wc and bc considered as weight and bias of cell state, respectively. Caption can be generated as:

LSTM equation, sequence modeling, hj=Cj=LSTM(yj,hj-1,Cj-1), machine learning formula.   (36)

Encoder and decoder
The proposed system encodes the data for machine translation using a CNN. In this case, the input and output are both sequences, but they may differ in length. One at a time, the machine encodes and decodes each vector. Using a vector as a starting point, the machine begins encoding and decoding, and continues computing until the final conditional probability distribution. One example is as follows:

Static equilibrium probability equation, P(kt=j|k1t−1,I), formula analysis.   (37)

This is considered the probability distribution.

The system can encode the data in the form of a vector image, and it can later be decoded. fcn (I) is considered the image model for image understanding.

Static equilibrium equation \( S_1 = \sigma(KS_0 + LX_1 + b) \), mathematical formula.   (38)

Neural network equation: S₂=σ(KS₁+LX₂+b), mathematical formula, function activation.   (39)

Neural network equation: Sn=σ(KSn-1+Lxn+b), formula diagram for activation function analysis.   (40)

S1 is the subsequent iteration of S0, and S2 is the subsequent iteration of S1. One could say that every input depends on the output of the previous layer. Images are converted into vectors by CNN and sent to the following layer, which traverses all of the vectors. Here, an attention mechanism is used to sequentially arrange the words into a meaningful sentence after the RNN decodes the vectors into words.

Static equilibrium equation S₀=hᵀ; formula for physics or engineering calculations.   (41)

Where T is the length of the input.

Recurrent neural network formula, \(S_t=RNN(S_{t-1}, e(\hat{y}_{t-1}))\), diagram.   (42)

Probability equation, softmax function for sequence prediction, mathematical formula.  (43)

k1, k2, k3, k4, ......, kt-1 are hidden decoding states.

Sequence-to-sequence model diagram, process flow with tokens, arrows, neural network.
Figure 2: Encoding and decoding model. This figure presents the encoder–decoder framework used for image captioning, showing how image features are encoded into vector representations and subsequently decoded into sequential textual descriptions. Please click here to view a larger version of this figure.

Process model
See Figure 3, which displays the training modules flowchart, where the dataset and its ground-truth captions were loaded first. After the data is normalized for CNN encoding, the ResNet model is initialized and trained using the extracted features. RNN and the system-specific words tagged with start and end markers can then be used to decode the caption. The system completes the extraction if the final word is found, and N is the total number of words in the candidate caption.

Caption generation flowchart with CNN, RNN model training, validation, feature extraction.
Figure 3: Flowchart of training model. This figure outlines the step-by-step process involved in training the model, including data preprocessing, feature extraction, model learning, and optimization. Please click here to view a larger version of this figure.

The flowchart of the testing model is shown in Figure 4, where the system first loads the encoder and decoder models, then loads the ResNet model and the input data for caption extraction. If there haven’t been any decode errors, inference can be done from the first word to the last. After the final word is reached, decoded words can be obtained, and a caption can be created by employing an attention mechanism to sequentially arrange the words in a meaningful way. The training model's beam size is five with a maximum length of 20, and its batch size is 128 with 20 epochs.

Flowchart illustrating caption generation using ResNet model; includes calculation of BLEU score.
Figure 4: Flowchart of testing model. This figure depicts the testing workflow, demonstrating how input images are processed through the trained model to generate captions and evaluate performance. Please click here to view a larger version of this figure.

ResNet-152 image captioning algorithm
Initialize the input and output parameters, and here the input is taken as the set of MSCOCO images as I = (i1, i2, i3, ....... iN) along with annotation J = (j1, j2, j3, ......... jN) and the output is computed as captions. In the very first step, input is required, then pre-process the images by resizing the aspect ratio as

Aspect ratio equation for image scaling, showing formula with variables and max function.    (44)

Where w and h are the original width and height of the image, wnew and hnew are the resized dimensions, Ts is considered a predefined target size (Ts = 224), max(w, h) defines the largest dimension, which has been scaled to maintain the aspect ratio.

After feature extraction, it is required to declare the identity block as

Mathematical expression f(i) + i; used in algorithm analysis or computational functions.   (45)

Then initialize the parameters like batch size, number of epochs, Whidden as weight for hidden layers, Woutput as for output layer, and Bheight , Bbias as bias. Once the initialization has been done, it is required to calculate the output of the convolutional layer.

Neural network formula diagram, \(O_t = ReLU(b_t * s_t(O_{t-1}) + id(O_{t-1}))\), function analysis.   (46)

It may be regarded as a normal ReLU block if bl is equivalent to 1. But if bl is not equal to 1 or equivalent to 0, then it would be;

Neural network activation formula \(O_l=ReLU(id(O_{l-1}))\).    (47)

Then compute survival feasibility by

Equation for static equilibrium, Fl=1-l/K(1-Fk), mathematical concept, educational use.   (48)

Where FK is considered as the survival feasibility of the system, and K is taken to represent the total number of blocks in the model. Then calculate the probability distribution

Probability formula, Bayesian inference, P(yt=j|yt-1,I), mathematical equation, computational analysis.    (49)

Once the probability distribution has been calculated, then builds the model to access it and decode the data using.

Recurrent neural network formula for sequence prediction, showing RNN recursive relation.   /9500

k1, k2, k3, k4, ......, kt-1 are hidden decoding states.

When accessing the model, it is required to apply attention mechanisms for caption generation that evaluate the candidate caption against the reference caption; final metrics can then be evaluated using BLEU, METEOR, CIDEr, and ROUGE.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Software and environment specifications
Python 3.10 was the main programming language used for the experiments. Visual Studio Code was used to set up the development environment (VS Code). Important libraries used in this research include Pickle for data serialization, multiprocessing for parallel processing,  glob for file handling, and PyTorch for deep learning model development. The hardware configuration included 256 GB of storage, 8 GB of RAM, and an NVIDIA GTX series GPU with CUDA support for faster computation. A computer running either an AMD Ryzen 5000 series processor or an Intel Core i5 processor was used for the experiments. Windows 10/11 was the operating system utilized for the implementation. It can be easily understood from the environment specifications table in Table 1.

MaterialSpecification
GPUNVIDIA GTX series
LibrariesPyTorch, Pickle, Multiprogramming, Glob
OSWindows 10/11
ProcessorIntel Core i5/AMD Ryzen 5000 series
ProgrammingPython 3.10
RAM8 GB
SoftwareVisual Studio Code
Storage256 GB

Table 1: Environment specifications. This table summarizes the materials used in the implementation and their specifications, such as programming languages, libraries, and hardware specifications.

Qualitative analysis
As per the qualitative analysis of the model according to the different categories, such as outdoor & indoor scenes and simple & complex scenes, the model is a bit efficient in describing the image. B1, B2, B3, and B4 are considered as BLEU scores. C is considered as CIDEr, M is METEOR, and R is considered as ROUGE. For every matrix where B1 is 0.579, B2 is 0.404, B3 is 0.279, B4 is 0.191, METEOR is 0.195, ROUGE is 0.396, and CIDEr is 0.6, the result is represented by 1, as illustrated in Table 2.

MatricesMSCOCO  Scores
BLEU10.579
BLEU20.404
BLEU30.279
BLEU40.191
METEOR0.195
ROUGE0.396
CIDEr0.6

Table 2: Experimental results. This table summarizes the performance of the proposed model using evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr, providing a quantitative assessment of caption quality.

Graph of matrices vs score, showing evaluation metrics like BLEU, ROUGE, and CIDEr for analysis.
Figure 5: Experimental result. This figure presents a graphical representation of the evaluation metrics, illustrating the comparative performance of the model across different measures. Please click here to view a larger version of this figure.

Result comparison is illustrated in Tables 3, 4, and 5. The following references are listed in Table 3, 3 & 4:10,11,12,13,14

MethodB1B2B3B4
Face-CapF [10]0.57130.36510.24070.1652
Face-Init [10]0.56630.36490.2430.1686
Face-CapL [11]0.5890.37890.25070.1719
Face-Step [10]0.58430.37560.24780.1696
CSPDN-BiLSTM-SelfAtt [12]0.60120.39920.27030.1921
CNN+RNN+ResNet-152 (Proposed)0.5790.4040.2790.191

Table 3: Result Comparison for BLEU Scores. This table compares BLEU score results across different models or configurations to highlight improvements in caption generation accuracy.

As shown in Tables 3 and 4, CSPDN-BiLSTM-SelfAtt12 performs better on B1 and B4, whereas CNN+RNN+ResNet-152 performs better on B2 and B3. CNN+RNN+ResNet-152 is better than METER and CIDEr, rather than ROUGE. So both the methods are equal in BLEU scores, but the proposed one is better than the other two metrics. So the overall superiority in the result is achieved by the proposed method. Face-CapF10, Face-Init10, Face-CapL11, Face-Step10 are performing image captioning based on the FlickrFace11K dataset. But the results are comparatively poor even for a large dataset. Even though the proposed model has a significantly higher CIDEr score, this discrepancy is caused by differences in the evaluation procedure, dataset preparation, and implementation specifics.

MethodMETEORCIDErROUGE
Face-CapF [10]0.17190.23040.4476
Face-Init [10]0.17170.23130.4484
Face-CapL [11]0.17440.24720.4547
Face-Step [10]0.17450.22830.4504
CSPDN-BiLSTM-SelfAtt [12]0.19320.26170.4793
CNN+RNN+ResNet-152 (Proposed)0.1950.60.396

Table 4: Result Comparison with respect to METEOR, CIDEr, and ROUGE. This table provides a comparative analysis of multiple evaluation metrics to assess the semantic and syntactic quality of generated captions.

MethodB1B2B3B4METEORROUGE
Template-Augmentation [13]0.2380.1090.050.0220.0960.249
EfficientNetB0 [14]0.28270.13250.05880.02660.26610.3609
EfficientNetB1 [14]0.2890.14040.06420.02860.2710.3718
ResNet50 [14]0.26370.12170.04960.02070.24370.3423
MobileNetV2 [14]0.21060.0640.02150.0090.17940.2606
CNN+RNN+ResNet-152 (Proposed)0.5790.4040.2790.1910.1950.396

Table 5: Result Comparison for BLEU, METEOR, and ROUGE Scores. This table presents a consolidated comparison of key evaluation metrics to demonstrate the overall effectiveness of the model.

As per Table 5, EfficientNetB114 is better for METEOR, but CNN+RNN+ResNet-152 is better for B1-B4 and ROUGE. Overall, the proposed result is superior across all BLEU and ROUGE metrics compared to the mentioned methods.

DATA AVAILABILITY:
All the raw data and coding files associated with this study are available in the supplemental files.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In the field of artificial intelligence, captioning images is a difficult task. Image captioning has been the subject of numerous studies, and acute or precise captioning still requires the highest level of precision. Many machine learning techniques can be used to accomplish the goal of image captioning, and numerous studies have used CNN, RNN, and ResNet-152. However, increased precision and reduced processing time are necessary. The proposed system is built using CNN as the encoder, RNN as the decoder, Torch Vision as the library, and ResNet as the primary training model. ResNet uses the skip connection technique to make use of the layers to achieve better performance compared to other conventional models like Face-CapF, Face-Init, Face-Step, Face-CapL, CSPDN-BiLSTM-SelfAtt, Template-Augmentation, EfficientNetB0, EfficientNetB1, MobileNetV2, and many more10,11,12,13,14.

The critical steps involved in the proposed work are using a smart filter to clear the images, and then feature extraction with all primary steps. Without precise feature extraction, it is not possible to achieve the goal of the model, and if the system fails to extract the features properly, then the accuracy of the metric scores is affected. The training phase, executed with deep analysis of the feature vectors and attention mechanism, played a vital role in decoding the testing data. There is also one more critical step involved in the work, which is updating the vocal. When new words arise while testing the data, those words are appended to the dictionary to improve the performance of the model. These critical steps played a vital role in attaining better accuracy, which was higher than that of the previously suggested model, such as the Template Augmentation Method. The system trained a model for the MSCOCO benchmark and obtained a more effective model for captioning images.

If the test data size increases, then it may be possible to have new words related to the images. It may also cause irrelevance while generating captions, and then it can be handled through the attention mechanism, which has been used in the model. Vocabulary can be updated through an attention mechanism that can be effective for later evaluation. It can be considered as self-learning or exception handling. As the model is trained with MSCOCO, which contains thousands of real-world images, there are so many objects that may arise that need to be updated at each inference.

One drawback of this work is that, compared with contemporary datasets used for training, the model may perform poorly on much older images, especially black-and-white or low-quality historical images, due to differences in visual features, contrast, and texture. If images are poor in resolution, then it is harder to extract the precise features, and ResNet-152 may degrade the encoding phase in this case. It also performs poorly on too many older images, which means those images are from ancient times because of poor or damaged feature vectors. Limitations include single-dataset evaluation and a lack of cross-validation.

Compared to conventional approaches, the proposed model is better because it enhances feature extraction, thereby improving image caption generation. Smart filtering improves the feature extraction or encoding phase, which better builds the model. ResNet-152 also uses skip connections that leverage time during training. So, the execution is much faster than other models like EfficientNetB014. The attention mechanism is also a primary factor that improves the model's performance.

The technique can be used in image retrieval systems, automated surveillance, and assistive technologies for people with visual impairments. As artificial intelligence advances rapidly, improving the image retrieval system is required, and this technique can contribute to that. With this model, visually impaired people can get assistance seeing the world by translating it into speech. There are several important and potential applications of image captioning.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported in this paper.

Acknowledgements

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

We acknowledge the creators of the MSCOCO datasets for providing the benchmarks used in this study. The authors declare that no external funding was received for this study.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
AMD Ryzen 5000 seriesAMD100-100000059WOFAMD Ryzen 5000 Series is a line of high-performance processors developed by AMD, based on the Zen 3 architecture. These processors are widely used in desktops and laptops for both general-purpose computing and demanding tasks such as data processing and machine learning workflows.
GPUNVIDIA 4.71933E+12The NVIDIA GeForce GTX is a series of graphics processing units (GPUs) developed by NVIDIA, widely used for gaming as well as general-purpose computing tasks like deep learning and image processing.
Intel Core i5IntelBX8071514400FIntel Core i5 is a mid-range processor series developed by Intel, widely used in personal computers for both general-purpose and computational tasks.
Python 3.10Python Software FoundationPEP 619Python is a high-level, interpreted programming language widely used in scientific computing, data analysis, and machine learning. It is known for its simplicity, readability, and extensive ecosystem of libraries.
PyTorchFacebook26.03-py3PyTorch is an open-source deep learning framework developed by Meta Platforms (formerly Facebook), widely used for building and training neural networks in research and industry.
Visual Studio CodeMicrosoftNoneVisual Studio Code (VS Code) is a lightweight, open-source code editor developed by Microsoft. It is widely used for software development, including machine learning and deep learning projects.
Windows 11MicrosoftKB5083631Windows 11 is an operating system developed by Microsoft, widely used for general computing as well as software development and machine learning tasks.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Image Caption GenerationDeep LearningResNet ModelSmart FilteringFeature EncodingCNN EncoderRNN DecoderObject ClassificationSemantic AnalysisMSCOCO Dataset

Related Articles