This protocol utilizes CNNs, RNNs, and ResNets for image captioning, extracting descriptions of the images' activities, people, objects, and other elements. It has been justified with BLEU, CIDEr, METEOR, and ROUGE metrics scores.
Research Article
June 12th, 2026
This protocol utilizes CNNs, RNNs, and ResNets for image captioning, extracting descriptions of the images' activities, people, objects, and other elements. It has been justified with BLEU, CIDEr, METEOR, and ROUGE metrics scores.
Image caption generation is an endeavor to provide a meaningful textual description that involves an image. Extracted information is relevant to the activities present in the images. ResNet (Residual Network) is well known for its ability to classify images, having developed deep hierarchical representations. The intention of this paper is to use ResNet with various smart filters to classify images more deeply, enabling the generation of genuine and meaningful descriptions that are highly precise with respect to the reference captions. Here, the work uses a smart filtering technique to enhance images, a CNN to encode features, model training, and thereafter an RNN (Recurrent Neural Network) to decode the features. ResNet is a very effective model for computer vision tasks, especially object classification and semantic analysis. ResNet is well known for residual connections, which are also known as skipping connections that solve the vanishing gradient problem, which is a crucial problem in deep learning. Here, the MSCOCO (Microsoft Common Object in Context) benchmark is used to train the model, which is a large dataset with reference annotations useful for various computer vision tasks. ResNet helps enhance generalization capability, which is particularly useful for diverse images. As per the results obtained, BLUE scores are B1: 0.579, B2: 0.404, B3: 0.279, B4: 0.191; METEOR: 0.195; ROUGE: 0.396; and CIDEr: 0.6.
In the fields of computer vision and natural language processing, image captioning is a crucial task that extracts a description of the image and the actions it depicts. The model's intention is to comprehend images and translate the information into meaningful sentences or captions1. The whole procedure consists of two significant phases: the first is feature extraction, where a CNN model is used; the second is image description using an RNN & in between, ResNet is used for semantic analysis, sequence generation, and an attention mechanism. ResNet is very different from template-based methods or DenseNet-based modules because it uses skip connections that reduce execution time while improving performance. There are numerous applications of image captioning that include helping visually impaired people, boosting social media platforms, optimizing image-based search engines, image-based AI (artificial Intelligence), and many more2.
In computer vision, scene recognition is the process of identifying and classifying the image's general context or environment, such as a beach, cityscape, forest, or office. Unlike object recognition, which focuses on individual items, scene recognition considers textures, spatial arrangements, and object relationships to understand the larger context. It uses CNNs and Vision Transformers, deep learning models trained on large datasets such as Places365 and ImageNet. Applications include security surveillance, augmented and virtual reality (AR and VR) for immersive experiences, robotics for environmental awareness, and autonomous vehicles for navigation. Despite advances, problems like shifting viewpoints, occlusions, and changing lighting make scene recognition a hot topic in computer vision and artificial intelligence research. Another fundamental problem in computer vision is scene recognition.
EnsCaption, a dual generative adversarial network model, was proposed to improve a generation–retrieval ensemble technique3. This layout enables harmonious, procreation-based image captioning methods that generate captions aligned with the existing goals. While the retrieval-based technique uses a position or grading-based model to select the best model for extracting information more precisely than the others in the image-based query. A mapping of images to a “meaning space” was introduced using visual components such as objects, activities, and scenes, which were then aligned with corresponding verbal templates4. Using the correlations and qualities found in the images, the approach constructs phrases. Sentences express information in a rich, condensed, and subtle way. Template-based caption generation was enhanced by incorporating commonsense knowledge to improve semantic understanding5. This technique extended the template's reach beyond direct image characteristics to encompass inferred associations. This work uses an existing object detection dataset to extract 16,000 common-sense statements for each annotated category. Additionally, generalization was achieved using WordNet, enabling the induction of a large number of facts about previously unseen objects6. Offers a review of an organized taxonomy of deep learning techniques for captioning images, including topics like attention mechanisms, reinforcement learning tactics, and encoder-decoder frameworks. Along with addressing issues such as object hallucination and contextual comprehension, it also examines commonly used datasets and assessment criteria. The authors point out areas for further study, such as improving vision-language pre-training techniques and reducing dataset bias. A semantic analysis approach based on convolutional neural networks and recurrent neural networks was explored for image captioning tasks7. Image captioning is one of the most well-known uses, allowing computers to produce evocative phrases that encapsulate an image. To provide high-level, significant semantic descriptions, this procedure entails more than merely identifying objects and scenes; it also involves examining their states, characteristics, and interactions. Despite the inherent complexity and difficulty of image captioning, academics have achieved impressive strides in the area. The three main deep neural network-based image captioning techniques covered in this study are CNN-RNN-based, CNN-CNN-based, and reinforcement learning frameworks. An end-to-end trainable model for image captioning was introduced, integrating computer vision and natural language processing to generate coherent descriptions of images8. To create a caption, it uses an encoder-decoder framework in which an LSTM decodes an image into a string of words after a pre-trained CNN encodes it into a feature vector. Notwithstanding its drawbacks, including difficulties with intricate sceneries, the paper's contribution to vision-language tasks is nevertheless fundamental9.
ResNet is the convolutional neural network (CNN) used in the proposed work's image captioning model to extract rich visual information from input images. ResNet serves as an encoder to produce a feature vector representing the image, which is usually used in an encoder-decoder architecture. The decoder, which generates word-by-word descriptive captions, receives these features and is often implemented using a recurrent neural network (RNN), such as LSTM or GRU. An attention mechanism can be added to improve performance by enabling the decoder to focus on specific regions of the image as it generates each word. To maximize caption accuracy, the model is trained end-to-end using a loss function such as cross-entropy and a dataset like COCO. Transfer learning and ResNet fine-tuning can enhance feature extraction, further strengthening the model and enabling it to produce high-quality, contextually appropriate captions across a wide range of images. In image captioning, ResNet is often preferred over other models because it effectively addresses the vanishing gradient problem, a common issue in deep neural networks. This is made possible by its novel residual learning approaches, which train considerably deeper networks without sacrificing performance by using skip connections to facilitate gradient flow during backpropagation. The multilayer perceptron, a fully connected feed-forward neural network, is associated with the trainable layer. The RNN then decodes captions using the softmax layer, producing candidate captions. The activation function is f(x), the forward identity function is f(x) + x, and x is regarded as identity, illustrated in Figure 1. In this case, the system uses residual blocks to calibrate the model during training, and its inputs pass through both weight connections and skip connections, also referred to as identity shortcuts.

Figure 1: Residual connection network. This figure illustrates the architecture of a residual network, highlighting skip connections that improve gradient flow and mitigate vanishing gradients during deep network training. Please click here to view a larger version of this figure.
Assume that Pl is the output; l is the no. of residual blocks; ReLU is supposed to be a customary block if it is close to 1, but if it is not equal to 1, then it can be computed as:
(1)
Here, b is the random variable, and k is the mapping function.
(2)
Here sl is considered as the probability of survival for the proposed system;
(3)
The resulting rule for survival probability as;
(4)
Where SL is supposed to probability survival as well as L is supposed to the total no. of blocks.
Image captioning is a challenging task that combines natural language processing and computer vision to produce descriptive textual captions for images. To do this, one must comprehend and interpret an image's visual content and translate it into coherent sentences within its context. In this field, having extensive and diverse datasets is crucial for model evaluation and training. These datasets offer a vast array of images and related annotations, which are crucial for developing and testing image captioning algorithms. The most frequently used datasets are MSCOCO and Flickr30k, which contain millions of images and pose various challenges in image processing. MSCOCO is much larger than Flickr30k11. The MS COCO dataset has been split into the following sets: 82,783 images for training, 40,504 for validation, and 40,775 for testing.
The implementation has been done with the main model, which is ResNet-152, along with an Encoder as CNN, a Decoder as RNN, and the resources from the Table of Materials.
ResNet-152
ResNet is considered the backbone for extracting features more efficiently in image captioning. ResNet provided better training performance than other models, as it addressed the vanishing gradient problem and efficiently resolved it. Various objects may appear in the images, and the model needs to understand their relationships for better captioning. That is why it can be considered a hierarchical feature extraction. ResNet-152 can handle complex computer vision tasks. The key advantage of this model is the effective use of residual or skip connections. It is highly effective at addressing the vanishing gradient issue. It can learn complex, robust features to achieve higher accuracy. ResNet-152 followed a bottleneck design that reduced computational cost and made it more effective than other architectures, such as VGG-16. It has a prominent transfer learning backbone suitable for pre-trained models and varied tasks such as object detection and data segmentation. The skip connection accelerated training and made it more stable. Compared to the transformer-based model, which uses a self-attention mechanism to understand sequential data, ResNet is quite different. A transformer-based model requires a large amount of data for a deep understanding of textual data, yielding effective results but running a bit slower. The motivation for selecting ResNet is its skip connections, which speed up execution with a significant improvement in results. In the field of image captioning, ResNet is used to extract the features to represent the object and the action performed in the image. ResNet used a residual network that leveraged skip connections. Here, the residual block can be computed with reference of input Z as:
(5)
Where Z is considered the input of the residual block.
is a residual function that involves batch normalization, convolutional layers, and ReLu activation. {xi} is considered the learning weight of the corresponding layers. Z also defines the skip connection identity, which also resolves the problem of vanishing gradient. ResNet is generally used as a feature extractor for visual feature mapping from the images. Here, I is considered as the input image for representing the feature maps into a high visual feature representation V.
(6)
Before extracting features, the image must be preprocessed to improve feature extraction. It is considered a raw image collected from the MSCOCO benchmark, so the first step in preprocessing is to resize and normalize it.
(7)
(8)
Where Hl is the height of the image and Wl is the weight of the image. Iresize is the resized image.
To normalize the pixel value from range [-1, 1] or [0, 1]
(9)
Where μ is the considered as the mean value of the pixel σ is considered as the standard deviation of the referenced image. The normalized image is now further processed for feature extraction.
(10)
Where
which is considered as the feature vector. When the row caption is tokenised then it is converted into the numeric format.
(11)
If caption splits into words then
(12)
Here, vocabulary plays an important role, with every word uniquely identified by integer-based indexing.
(13)
Where Vc is considered as a vocabulary function; it has to be ensured that all the sequences have an even length; so the maximum height or ideal length is considered as Lmax.
(14)
Now token get embedded as;
(15)
for j = 1,2,3, ... .., Lmax
Where
is considered as an embedded vector with K dimensions; now the decoder is to be used to decode the caption for candidate caption generation, which is based on a probabilistic model.
(16)
Where wj is a work at time stamp j, w1: j-1 is the generated word at time stamp j-1 and ej-1 is the embedded feature with the previous word wj-1. At every time stamp, the network predicted the next upcoming word or probability is computed over the vocabulary.
(17)
Where woutput is the output weight and boutput is the output bias. So the max probability is computed as
(18)
The maximum length of the candidate caption is computed once the word is received or identified as a special token like and . Beam search is also useful for selecting the better candidate caption, so the sequence is:
(19)
(20)
So the generated candidate caption is the sequence of 
Long Short-Term Memory is generally used in sequence generation. LSTM uses a CNN as a feature extractor and generates words sequentially to create meaningful sentences. LSTM computes the forget gate at each time stamp T.
![figure-protocol-21 LSTM gate equation; ft=σ(wf[ht-1,yt]+bf); neural network formula; computational diagram.](/files/ftp_upload/71528/71528eq42v2.jpg)
Where ft is considered as forget gate, σ is considered as activation function, wf is considered as weight and bf as bias,
yt is considered as input feature vector, ht-1 is considered as hidden state.
(22)
(23)
Jt is considered as input,
is considered as candidate state, wj & wc are considered as weight for input and candidate state respectively, bj & bc or considered as bias.
(24)
Ct is Considered as all state, Ct-1 is considered as previous state.
(25)
Ot is considered as output, wo as weight and bo as bias. To initialize the hidden and cell states, the following computations are required.
(26)
(27)
Where hi and Ci are considered as the hidden and cell state, respectively, wh and wc are weights for hidden and sail cell state respectively, bc and bh are considered as bias, k is considered as the feature extractor. The sequence of the caption is computed as:
(28)
Where T is the length of generated caption.
254 × 254 × 3 is the resized or pre-processed image, and I is considered as the input image.
(29)
Where W and b are considered as weight and bias, respectively, I is considered as input features, and ReLU is the activation function. It is the computation of the convolutional layer. Now the pooling layer can be computed as:
(30)
After finalizing the pooling layer; the fully connected layer can be mapped as:
(31)
Where wf and bf are considered as weight and bias of the network respectively.
(32)
(33)
Where N is considered as the spatial region and d as the dimension of the feature.
(34)
(35)
Where wh and bh are considered as weight and bias of hidden state, respectively, wc and bc considered as weight and bias of cell state, respectively. Caption can be generated as:
(36)
Encoder and decoder
The proposed system encodes the data for machine translation using a CNN. In this case, the input and output are both sequences, but they may differ in length. One at a time, the machine encodes and decodes each vector. Using a vector as a starting point, the machine begins encoding and decoding, and continues computing until the final conditional probability distribution. One example is as follows:
(37)
This is considered the probability distribution.
The system can encode the data in the form of a vector image, and it can later be decoded. fcn (I) is considered the image model for image understanding.
(38)
(39)
(40)
S1 is the subsequent iteration of S0, and S2 is the subsequent iteration of S1. One could say that every input depends on the output of the previous layer. Images are converted into vectors by CNN and sent to the following layer, which traverses all of the vectors. Here, an attention mechanism is used to sequentially arrange the words into a meaningful sentence after the RNN decodes the vectors into words.
(41)
Where T is the length of the input.
(42)
(43)
k1, k2, k3, k4, ......, kt-1 are hidden decoding states.

Figure 2: Encoding and decoding model. This figure presents the encoder–decoder framework used for image captioning, showing how image features are encoded into vector representations and subsequently decoded into sequential textual descriptions. Please click here to view a larger version of this figure.
Process model
See Figure 3, which displays the training modules flowchart, where the dataset and its ground-truth captions were loaded first. After the data is normalized for CNN encoding, the ResNet model is initialized and trained using the extracted features. RNN and the system-specific words tagged with start and end markers can then be used to decode the caption. The system completes the extraction if the final word is found, and N is the total number of words in the candidate caption.

Figure 3: Flowchart of training model. This figure outlines the step-by-step process involved in training the model, including data preprocessing, feature extraction, model learning, and optimization. Please click here to view a larger version of this figure.
The flowchart of the testing model is shown in Figure 4, where the system first loads the encoder and decoder models, then loads the ResNet model and the input data for caption extraction. If there haven’t been any decode errors, inference can be done from the first word to the last. After the final word is reached, decoded words can be obtained, and a caption can be created by employing an attention mechanism to sequentially arrange the words in a meaningful way. The training model's beam size is five with a maximum length of 20, and its batch size is 128 with 20 epochs.

Figure 4: Flowchart of testing model. This figure depicts the testing workflow, demonstrating how input images are processed through the trained model to generate captions and evaluate performance. Please click here to view a larger version of this figure.
ResNet-152 image captioning algorithm
Initialize the input and output parameters, and here the input is taken as the set of MSCOCO images as I = (i1, i2, i3, ....... iN) along with annotation J = (j1, j2, j3, ......... jN) and the output is computed as captions. In the very first step, input is required, then pre-process the images by resizing the aspect ratio as
(44)
Where w and h are the original width and height of the image, wnew and hnew are the resized dimensions, Ts is considered a predefined target size (Ts = 224), max(w, h) defines the largest dimension, which has been scaled to maintain the aspect ratio.
After feature extraction, it is required to declare the identity block as
(45)
Then initialize the parameters like batch size, number of epochs, Whidden as weight for hidden layers, Woutput as for output layer, and Bheight , Bbias as bias. Once the initialization has been done, it is required to calculate the output of the convolutional layer.
(46)
It may be regarded as a normal ReLU block if bl is equivalent to 1. But if bl is not equal to 1 or equivalent to 0, then it would be;
(47)
Then compute survival feasibility by
(48)
Where FK is considered as the survival feasibility of the system, and K is taken to represent the total number of blocks in the model. Then calculate the probability distribution
(49)
Once the probability distribution has been calculated, then builds the model to access it and decode the data using.
/9500
k1, k2, k3, k4, ......, kt-1 are hidden decoding states.
When accessing the model, it is required to apply attention mechanisms for caption generation that evaluate the candidate caption against the reference caption; final metrics can then be evaluated using BLEU, METEOR, CIDEr, and ROUGE.
Software and environment specifications
Python 3.10 was the main programming language used for the experiments. Visual Studio Code was used to set up the development environment (VS Code). Important libraries used in this research include Pickle for data serialization, multiprocessing for parallel processing, glob for file handling, and PyTorch for deep learning model development. The hardware configuration included 256 GB of storage, 8 GB of RAM, and an NVIDIA GTX series GPU with CUDA support for faster computation. A computer running either an AMD Ryzen 5000 series processor or an Intel Core i5 processor was used for the experiments. Windows 10/11 was the operating system utilized for the implementation. It can be easily understood from the environment specifications table in Table 1.
| Material | Specification |
| GPU | NVIDIA GTX series |
| Libraries | PyTorch, Pickle, Multiprogramming, Glob |
| OS | Windows 10/11 |
| Processor | Intel Core i5/AMD Ryzen 5000 series |
| Programming | Python 3.10 |
| RAM | 8 GB |
| Software | Visual Studio Code |
| Storage | 256 GB |
Table 1: Environment specifications. This table summarizes the materials used in the implementation and their specifications, such as programming languages, libraries, and hardware specifications.
Qualitative analysis
As per the qualitative analysis of the model according to the different categories, such as outdoor & indoor scenes and simple & complex scenes, the model is a bit efficient in describing the image. B1, B2, B3, and B4 are considered as BLEU scores. C is considered as CIDEr, M is METEOR, and R is considered as ROUGE. For every matrix where B1 is 0.579, B2 is 0.404, B3 is 0.279, B4 is 0.191, METEOR is 0.195, ROUGE is 0.396, and CIDEr is 0.6, the result is represented by 1, as illustrated in Table 2.
| Matrices | MSCOCO Scores |
| BLEU1 | 0.579 |
| BLEU2 | 0.404 |
| BLEU3 | 0.279 |
| BLEU4 | 0.191 |
| METEOR | 0.195 |
| ROUGE | 0.396 |
| CIDEr | 0.6 |
Table 2: Experimental results. This table summarizes the performance of the proposed model using evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr, providing a quantitative assessment of caption quality.

Figure 5: Experimental result. This figure presents a graphical representation of the evaluation metrics, illustrating the comparative performance of the model across different measures. Please click here to view a larger version of this figure.
Result comparison is illustrated in Tables 3, 4, and 5. The following references are listed in Table 3, 3 & 4:10,11,12,13,14
| Method | B1 | B2 | B3 | B4 |
| Face-CapF [10] | 0.5713 | 0.3651 | 0.2407 | 0.1652 |
| Face-Init [10] | 0.5663 | 0.3649 | 0.243 | 0.1686 |
| Face-CapL [11] | 0.589 | 0.3789 | 0.2507 | 0.1719 |
| Face-Step [10] | 0.5843 | 0.3756 | 0.2478 | 0.1696 |
| CSPDN-BiLSTM-SelfAtt [12] | 0.6012 | 0.3992 | 0.2703 | 0.1921 |
| CNN+RNN+ResNet-152 (Proposed) | 0.579 | 0.404 | 0.279 | 0.191 |
Table 3: Result Comparison for BLEU Scores. This table compares BLEU score results across different models or configurations to highlight improvements in caption generation accuracy.
As shown in Tables 3 and 4, CSPDN-BiLSTM-SelfAtt12 performs better on B1 and B4, whereas CNN+RNN+ResNet-152 performs better on B2 and B3. CNN+RNN+ResNet-152 is better than METER and CIDEr, rather than ROUGE. So both the methods are equal in BLEU scores, but the proposed one is better than the other two metrics. So the overall superiority in the result is achieved by the proposed method. Face-CapF10, Face-Init10, Face-CapL11, Face-Step10 are performing image captioning based on the FlickrFace11K dataset. But the results are comparatively poor even for a large dataset. Even though the proposed model has a significantly higher CIDEr score, this discrepancy is caused by differences in the evaluation procedure, dataset preparation, and implementation specifics.
| Method | METEOR | CIDEr | ROUGE |
| Face-CapF [10] | 0.1719 | 0.2304 | 0.4476 |
| Face-Init [10] | 0.1717 | 0.2313 | 0.4484 |
| Face-CapL [11] | 0.1744 | 0.2472 | 0.4547 |
| Face-Step [10] | 0.1745 | 0.2283 | 0.4504 |
| CSPDN-BiLSTM-SelfAtt [12] | 0.1932 | 0.2617 | 0.4793 |
| CNN+RNN+ResNet-152 (Proposed) | 0.195 | 0.6 | 0.396 |
Table 4: Result Comparison with respect to METEOR, CIDEr, and ROUGE. This table provides a comparative analysis of multiple evaluation metrics to assess the semantic and syntactic quality of generated captions.
| Method | B1 | B2 | B3 | B4 | METEOR | ROUGE |
| Template-Augmentation [13] | 0.238 | 0.109 | 0.05 | 0.022 | 0.096 | 0.249 |
| EfficientNetB0 [14] | 0.2827 | 0.1325 | 0.0588 | 0.0266 | 0.2661 | 0.3609 |
| EfficientNetB1 [14] | 0.289 | 0.1404 | 0.0642 | 0.0286 | 0.271 | 0.3718 |
| ResNet50 [14] | 0.2637 | 0.1217 | 0.0496 | 0.0207 | 0.2437 | 0.3423 |
| MobileNetV2 [14] | 0.2106 | 0.064 | 0.0215 | 0.009 | 0.1794 | 0.2606 |
| CNN+RNN+ResNet-152 (Proposed) | 0.579 | 0.404 | 0.279 | 0.191 | 0.195 | 0.396 |
Table 5: Result Comparison for BLEU, METEOR, and ROUGE Scores. This table presents a consolidated comparison of key evaluation metrics to demonstrate the overall effectiveness of the model.
As per Table 5, EfficientNetB114 is better for METEOR, but CNN+RNN+ResNet-152 is better for B1-B4 and ROUGE. Overall, the proposed result is superior across all BLEU and ROUGE metrics compared to the mentioned methods.
DATA AVAILABILITY:
All the raw data and coding files associated with this study are available in the supplemental files.
In the field of artificial intelligence, captioning images is a difficult task. Image captioning has been the subject of numerous studies, and acute or precise captioning still requires the highest level of precision. Many machine learning techniques can be used to accomplish the goal of image captioning, and numerous studies have used CNN, RNN, and ResNet-152. However, increased precision and reduced processing time are necessary. The proposed system is built using CNN as the encoder, RNN as the decoder, Torch Vision as the library, and ResNet as the primary training model. ResNet uses the skip connection technique to make use of the layers to achieve better performance compared to other conventional models like Face-CapF, Face-Init, Face-Step, Face-CapL, CSPDN-BiLSTM-SelfAtt, Template-Augmentation, EfficientNetB0, EfficientNetB1, MobileNetV2, and many more10,11,12,13,14.
The critical steps involved in the proposed work are using a smart filter to clear the images, and then feature extraction with all primary steps. Without precise feature extraction, it is not possible to achieve the goal of the model, and if the system fails to extract the features properly, then the accuracy of the metric scores is affected. The training phase, executed with deep analysis of the feature vectors and attention mechanism, played a vital role in decoding the testing data. There is also one more critical step involved in the work, which is updating the vocal. When new words arise while testing the data, those words are appended to the dictionary to improve the performance of the model. These critical steps played a vital role in attaining better accuracy, which was higher than that of the previously suggested model, such as the Template Augmentation Method. The system trained a model for the MSCOCO benchmark and obtained a more effective model for captioning images.
If the test data size increases, then it may be possible to have new words related to the images. It may also cause irrelevance while generating captions, and then it can be handled through the attention mechanism, which has been used in the model. Vocabulary can be updated through an attention mechanism that can be effective for later evaluation. It can be considered as self-learning or exception handling. As the model is trained with MSCOCO, which contains thousands of real-world images, there are so many objects that may arise that need to be updated at each inference.
One drawback of this work is that, compared with contemporary datasets used for training, the model may perform poorly on much older images, especially black-and-white or low-quality historical images, due to differences in visual features, contrast, and texture. If images are poor in resolution, then it is harder to extract the precise features, and ResNet-152 may degrade the encoding phase in this case. It also performs poorly on too many older images, which means those images are from ancient times because of poor or damaged feature vectors. Limitations include single-dataset evaluation and a lack of cross-validation.
Compared to conventional approaches, the proposed model is better because it enhances feature extraction, thereby improving image caption generation. Smart filtering improves the feature extraction or encoding phase, which better builds the model. ResNet-152 also uses skip connections that leverage time during training. So, the execution is much faster than other models like EfficientNetB014. The attention mechanism is also a primary factor that improves the model's performance.
The technique can be used in image retrieval systems, automated surveillance, and assistive technologies for people with visual impairments. As artificial intelligence advances rapidly, improving the image retrieval system is required, and this technique can contribute to that. With this model, visually impaired people can get assistance seeing the world by translating it into speech. There are several important and potential applications of image captioning.
The authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported in this paper.
We acknowledge the creators of the MSCOCO datasets for providing the benchmarks used in this study. The authors declare that no external funding was received for this study.
| Name | Company | Catalog Number | Comments |
|---|---|---|---|
| AMD Ryzen 5000 series | AMD | 100-100000059WOF | AMD Ryzen 5000 Series is a line of high-performance processors developed by AMD, based on the Zen 3 architecture. These processors are widely used in desktops and laptops for both general-purpose computing and demanding tasks such as data processing and machine learning workflows. |
| GPU | NVIDIA | 4.71933E+12 | The NVIDIA GeForce GTX is a series of graphics processing units (GPUs) developed by NVIDIA, widely used for gaming as well as general-purpose computing tasks like deep learning and image processing. |
| Intel Core i5 | Intel | BX8071514400F | Intel Core i5 is a mid-range processor series developed by Intel, widely used in personal computers for both general-purpose and computational tasks. |
| Python 3.10 | Python Software Foundation | PEP 619 | Python is a high-level, interpreted programming language widely used in scientific computing, data analysis, and machine learning. It is known for its simplicity, readability, and extensive ecosystem of libraries. |
| PyTorch | 26.03-py3 | PyTorch is an open-source deep learning framework developed by Meta Platforms (formerly Facebook), widely used for building and training neural networks in research and industry. | |
| Visual Studio Code | Microsoft | None | Visual Studio Code (VS Code) is a lightweight, open-source code editor developed by Microsoft. It is widely used for software development, including machine learning and deep learning projects. |
| Windows 11 | Microsoft | KB5083631 | Windows 11 is an operating system developed by Microsoft, widely used for general computing as well as software development and machine learning tasks. |
Request permission to reuse the text or figures of this JoVE article
Request Permission