INSIPR: Image-Text Synthesis Pipeline for Intelligent Retrieval and Generation
DOI:
https://doi.org/10.70917/ijcisim-2025-0044Abstract
The INSPIR (Image-Text Synthesis Pipeline for Intelligent Retrieval and Generation) framework introduces an innovative approach for image captioning and image retrieval by leveraging an ensemble of state-of-the-art models. This research proposes a method that generates descriptive captions from images using an ensemble of BLIP (Bootstrapping Language-Image Pre-training), ViT-GPT2 (Vision Transformer combined with GPT-2), and GIT (Generative Image Text) and employing CLIP (Contrastive Language-Image Pre-training) for ranking the generated captions based on their relevance. The impact of temperature scaling and ensemble weights on the generated caption ranking was analyzed to evaluate the system, revealing insights regarding the balance of relevance and diversity. Testing on the Flickr8k dataset demonstrated the model's effectiveness, achieving cosine similarity, BLEU, and METEOR scores on randomly selected photos. The top-ranked captions are utilized by Llama3.1 to produce creative outputs tailored for various applications, including social media captions and image notes. By integrating multiple modalities within a unified semantic space through contrastive learning, this work aims to advance the field of image captioning beyond conventional classification tasks, offering a generalized model performance that addresses the complexities of language and vision. One of the major applications of the INSPIR model is image retrieval, where the system enhances capabilities by annotating uploaded images and enabling users to conduct text-based searches, facilitating efficient access to relevant visual content.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Viomesh Singh, Prithviraj Jadhav, Hritesh Maikap, Sarvesh Jadhav, Chinmay Ingale, Sahil Jadhav

This work is licensed under a Creative Commons Attribution 4.0 International License.