Analyzing Visual Content across Frames to Identify Objects and Generating Textual Descriptions for Images
Abstract
Generating an image description is called a caption. Caption generators have been growing popularity since 2012. Every day, viewers come across many images from different sources that they want to interpret themselves, such as the Internet, TV, news articles, social network sites and more. Though generating caption for an image is a challenging task but recent advancement in Computer Vision has made it possible that a machine can be a storyteller of an image. Most of the time, we do not have images with descriptions. But normal human can easily understand those images. If humans want an automatic image caption from a machine, then machine needs to interpret some image captions. Image Caption Generation model is mostly based on Encoder-Decoder approach like a language translation model. Language translation model uses RNN (Recurrent Neural Network) for Encoding as well as Decoding but Image Caption Generation model uses CNN (Convolutional Neural Network) for Encoding and RNN for Decoding. In this paper, the machine takes an image as input and process a sequence of words as output (caption).